Wireshark-dev: Re: [Wireshark-dev] Bzip2 support

From: ronnie sahlberg <ronniesahlberg@xxxxxxxxx>
Date: Thu, 27 Jun 2019 09:36:08 +1000
On Thu, Jun 27, 2019 at 7:17 AM Guy Harris <guy@xxxxxxxxxxxx> wrote:
>
> On Jun 26, 2019, at 2:03 PM, Jaap Keuter <jaap.keuter@xxxxxxxxx> wrote:
>
> > On 26 Jun 2019, at 19:41, Guy Harris <guy@xxxxxxxxxxxx> wrote:
> >
> >> It could probably be done (note that for decompressing capture files that would require the ability to do random access I/O,
> >
> > It (http://sourceware.org/bzip2/manual/manual.html#limits) now says: "Further ahead, it would be nice to be able to do random access into files. This will require some careful design of compressed file formats."
>
> gzip format wasn't carefully designed for that, either, but it can be - and has been - made to work.  It requires storing dictionary state.

Yepp. BGZIP and its library you can link with does this. I even built
a fuse filesystem to transparently "unzip" these kind of files.

What BGZIP does is that it will restart a new dictionary every ~64k
bytes and also stores an index in a separate file.
The bgzip file itself is compatible with gzip so you can uncompress it
using vanilla gzip
but in order to do random reads/seek in the file you need the index file.

It works, quite well.
The problem I found is that when you restart with a new dictionary
every ~64kb there is not much for the compression engine to work with
so compression ratio is usually (in my cases) quite poor compared to
normal gzip.