Wireshark-dev: [Wireshark-dev] tvb_get_string_enc() doesn't always return valid UTF-8

From: Martin Kaiser <lists@xxxxxxxxx>
Date: Mon, 20 Jan 2014 18:22:37 +0100
Hi,

if I have a tvbuff that starts with 0x86 and I call

a = tvb_get_string_enc(tvb, 0, ENC_ASCII)
proto_tree_add_string(..., a);

I can trigger the DISSECTOR_ASSERT since a is not a valid unicode string.

Comments in the code suggest that tvb_get_string() should replace
chars>=0x80 with the unicode replacement char, which is two bytes long.
This would look like

guint8 *
tvb_get_string(wmem_allocator_t *scope, tvbuff_t *tvb, gint offset, gint length)
{
        wmem_strbuf_t *str;

        tvb_ensure_bytes_exist(tvb, offset, length);
        str = wmem_strbuf_new(scope, "");

        while (length > 0) {
                guint8 ch = tvb_get_guint8(tvb, offset);

                if (ch < 0x80)
                        wmem_strbuf_append_c(str, ch);
                else {
                        wmem_strbuf_append_unichar(str, UNREPL);
                }
                offset++;
                length--;
        }
        wmem_strbuf_append_c(str, '\0');

        return (guint8 *) wmem_strbuf_get_str(str);
}


The resulting string would still contain len+1 chars but not necessarily
len+1 bytes. Would that be a problem, i.e. is it ok to do sth like

b = tvb_get_string(NULL, tvb, offset, len_b);
copy_of_b = g_malloc(len_b+1);
memcpy(copy_of_b, b, len_b+1);

?

If that should work, we'd need a separate function for get string &
replace 8bit chars.

Thoughts?

   Martin