Wireshark-dev: Re: [Wireshark-dev] Regenerating packet-parlay.c

From: Luke Mewburn <luke@xxxxxxxxxxx>
Date: Mon, 4 May 2020 19:16:40 +1000
On 20-05-04 10:55, Jaap Keuter wrote:
  | Hi Luke,
  | 
  | > Yes, I regenerated the code using that patch to
  | > tools/wireshark_gen.py, and it builds fine across a couple of
  | > platforms.
  | 
  | Yes, I could build it too, so the generated code itself was okay. My
  | quest is to regenerate the _same_ source files.

I don't think that's going to be possible to regenerate the same as the
currently committed file, because it was generated with wireshark_gen.py
whose implementation that did not guarantee reproducible output.

I think that unless you have access to the exact system setup that
was used to generate packet-parlay.c that you won't be able reproduce
the files as-is with the existing tool anyway.


  | > I wrote a simple wrapper tool idl-regen (attached) to assist each
  | > regen from the idl, used with something like:
  | > 	$ ./idl-regen ../epan/dissectors/
  | > 	mmccs.idl:120: Warning: Identifier 'EventType' clashes with CORBA 3 keyword 'eventtype'
  | > 	policy_data.idl:119: Warning: Anonymous sequences for recursive unions are deprecated. Use a forward declaration instead.
  | > 	policy_data.idl:123: Warning: Anonymous sequences for recursive unions are deprecated. Use a forward declaration instead.
  | > 
  | > I get the same .c files generated from either:
  | > - CentOS 7,  python 2.7.5, omniidl 4.2.0
  | > - Fedora 31, python 3.7.6, omniidl 4.2.3
  | > 
  | 
  | So the code generation is identical between Python 2 and 3, that's a
  | good sign.

Yes.


  | > Three files get changes from git master: - packet-gias.c -
  | > packet-parlay.c - packet-tango.c
  | > 
  | > The generated code for packet-gias.c and packet-tango.c differ to
  | > git master because of the following fix to idl2wrs that doesn't
  | > appear to have been rerun to regenerate those files: commit
  | > 443df93896 Author: Yannik Enss <Yannik.Enss@xxxxxxxxxxxxxxxxx>
  | > Date:   2019-05-29 14:20:42 +0200
  | > 
  | > 	idl2wrs: fix 'undeclared identifier' error
  | > 
  | > 	the 'x_octetx' variables were removed a few years back, replace
  | > 	them with get_CDR_xxx()
  | > 	I8cf3410d8a152c834e7019f7d1d80de3798530c3
  | > 
  | 
  | Good find. I've applied the anti-patch for this and reran code
  | generation. Now the same source files come out (with the exception
  | below). I've also build with these files without issue, so can
  | anyone tell me what this commit achieves? It references something 'a
  | few years back' which doesn't help either. If nothing I'm inclined
  | to apply this anti-patch.
  | 
  | Gerald, you seem to have 'rubber stamped' this patch series about a
  | year ago. I know it's much to ask, but I can't find anything on
  | this, no bug report, no wireshark-dev posts, can you tell what this
  | was for and why the repository code is not the same as the code
  | generated?

I actually think the change is probably fine to leave, but I'll defer
to others.


  | > packet-parlay.c gets that above change as well as many other
  | > changes to generated code for /* User exception filters */,
  | > mostly in the ordering of the declarations and the list of
  | > array entries added to proto_register_giop_parlay() variable
  | > 	static hf_register_info hf[] 
  | > 
  | 
  | Yes, this was the trigger for my initial question. Why is that ordering
  | different than what's committed in the repo. My position is that
  | generated files must always be possible to be regenerated.

I completely agree that generated files must be able to be reliably
regenerated from the same inputs and tools without change.

As we both know, the issue is that the existing tool did not guarantee
reproducible output, due to the use of a python dict which did not
guarantee a particular ordering before python 3.6.



  | > As far as I can tell, the new packet-parlay.c has the same
  | > number of lines, just in different order.
  | 
  | That was my impression too, but there are just too many to check by hand.
  | 
  | 
  | > and compared the regenerated file with that, they both still have same
  | > size and number of lines, and if I sort the lines in both the files are
  | > identical. (Of course, that doesn't compile... :)
  | > 
  | 
  | Oh, that is indeed a nice trick to see if at least the same lines are in the
  | generated code files. I'm going to use that.

It wasn't super robust, but it was just part of a broader group of
tests I ran.


  | > I tried doing some comparison of the generated object files but
  | > they differed too much, probably because of the reordering of the
  | > source lines, even though the hf[] array ends up with the same
  | > content in different order.
  | > 
  | 
  | Okay, so your statement is that we just have to accept that the
  | ordering of the the currently generated source file is going to be
  | different from the committed source file. Do you think we still
  | should introduce the ordering statements?

I think we should use the OrderedDict change I proposed, which means
that the ordering will be in order the data structure is populated.
This matches all of the other data structures in wireshark_gen.py
and wireshark_be.py (lists).

I don't think we need to change the code to sort all the different
identifiers and generated code, however. That will result in a bigger
change to all the source files generated from IDL.


  | > I don't know how much more verification is required, however.  It
  | > looks like in the past there have been changes to wireshark_gen.py
  | > and then regen of packet-parlay.c that reordered a lot of the
  | > content.  For example, see commit c99bee9b5d on 2019-06-05.
  | > 
  | 
  | Well, for me now reproducibility is the main goal. With your 'line
  | sorting trick' we can at least confidently state that the actual
  | items are not changed, and have to assume that the code generation
  | is solid enough to create equivalent code. I'm okay with that.

There may be some more testing that could be done with processing the
output of objdump, but I'm not sure it's worth it.

As mentioned previously, I suspect that this level of verification
of regenerated files has not been applied to the previously
committed versions of these regenerated files based on reviewing
their commit history.


Another other test, of course, is to process some parlay captures with
tshark with a full decode, and compare the output :)


cheers,
Luke.

Attachment: pgpw7jVYCDa5O.pgp
Description: PGP signature