Wireshark · Wireshark-dev: Re: [Wireshark-dev] tvb_get_string_enc() doesn't always return valid UTF-8

Wireshark-dev: Re: [Wireshark-dev] tvb_get_string_enc() doesn't always return valid UTF-8

Date: Mon, 20 Jan 2014 13:23:20 -0500

There was a bug where Guy and I discussed strings in depth (though I
can't find it at the moment).

I think we'd agreed that the right thing to do is to convert most of
our string functions to handle and return counted strings
(wmem_strbuf_t or something) and then do the replacement as you
suggest. There are several other outstanding encoding issues
(especially around embedded NULLs) where string length cannot be
reliably managed without explicitly counting it.

Unfortunately it's a relatively large API change, but I think it's the
right thing going forward, especially since we already use a
wmem_strbuf_t in most of the _get_string functions already (we just
don't return it).

Evan

On Mon, Jan 20, 2014 at 12:22 PM, Martin Kaiser <lists@xxxxxxxxx> wrote:
> Hi,
>
> if I have a tvbuff that starts with 0x86 and I call
>
> a = tvb_get_string_enc(tvb, 0, ENC_ASCII)
> proto_tree_add_string(..., a);
>
> I can trigger the DISSECTOR_ASSERT since a is not a valid unicode string.
>
> Comments in the code suggest that tvb_get_string() should replace
> chars>=0x80 with the unicode replacement char, which is two bytes long.
> This would look like
>
> guint8 *
> tvb_get_string(wmem_allocator_t *scope, tvbuff_t *tvb, gint offset, gint length)
> {
>         wmem_strbuf_t *str;
>
>         tvb_ensure_bytes_exist(tvb, offset, length);
>         str = wmem_strbuf_new(scope, "");
>
>         while (length > 0) {
>                 guint8 ch = tvb_get_guint8(tvb, offset);
>
>                 if (ch < 0x80)
>                         wmem_strbuf_append_c(str, ch);
>                 else {
>                         wmem_strbuf_append_unichar(str, UNREPL);
>                 }
>                 offset++;
>                 length--;
>         }
>         wmem_strbuf_append_c(str, '\0');
>
>         return (guint8 *) wmem_strbuf_get_str(str);
> }
>
>
> The resulting string would still contain len+1 chars but not necessarily
> len+1 bytes. Would that be a problem, i.e. is it ok to do sth like
>
> b = tvb_get_string(NULL, tvb, offset, len_b);
> copy_of_b = g_malloc(len_b+1);
> memcpy(copy_of_b, b, len_b+1);
>
> ?
>
> If that should work, we'd need a separate function for get string &
> replace 8bit chars.
>
> Thoughts?
>
>    Martin
> ___________________________________________________________________________
> Sent via:    Wireshark-dev mailing list <wireshark-dev@xxxxxxxxxxxxx>
> Archives:    http://www.wireshark.org/lists/wireshark-dev
> Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
>              mailto:wireshark-dev-request@xxxxxxxxxxxxx?subject=unsubscribe

Follow-Ups:
- Re: [Wireshark-dev] tvb_get_string_enc() doesn't always return valid UTF-8
  - From: Martin Kaiser

References:
- [Wireshark-dev] tvb_get_string_enc() doesn't always return valid UTF-8
  - From: Martin Kaiser

Prev by Date: [Wireshark-dev] tvb_get_string_enc() doesn't always return valid UTF-8
Next by Date: Re: [Wireshark-dev] tvb_get_string_enc() doesn't always return valid UTF-8
Previous by thread: [Wireshark-dev] tvb_get_string_enc() doesn't always return valid UTF-8
Next by thread: Re: [Wireshark-dev] tvb_get_string_enc() doesn't always return valid UTF-8
Index(es):
- Date
- Thread