Wireshark-bugs: [Wireshark-bugs] [Bug 10681] UTF-8 replacement characters in FT_STRINGs are esca

Date: Thu, 14 Apr 2016 02:51:08 +0000

Comment # 8 on bug 10681 from
(In reply to Jeff Morriss from comment #7)
> Hmmm...  Why isn't tvb_get_string_enc() returning valid UTF8 (like it says
> it will)?

If it says that, it lies.

It calls tvb_get_utf_8_string() to extract the string, and
tvb_get_utf_8_string() is:

/*
 * Given a wmem scope, a tvbuff, an offset, and a length, treat the string
 * of bytes referred to by the tvbuff, the offset. and the length as a UTF-8
 * string, and return a pointer to that string, allocated using the wmem scope.
 *
 * XXX - should map invalid UTF-8 sequences to UNREPL.
 */
static guint8 *
tvb_get_utf_8_string(wmem_allocator_t *scope, tvbuff_t *tvb, const gint offset,
const gint length)
{
    guint8 *strbuf;

    tvb_ensure_bytes_exist(tvb, offset, length); /* make sure length = -1 fails
*/
    strbuf = (guint8 *)wmem_alloc(scope, length + 1);
    tvb_memcpy(tvb, strbuf, offset, length);
    strbuf[length] = '\0';
    return strbuf;
}

which does *no* validation of the string whatsoever.

I seem to remember some discussion of this and some concern that doing the
validation would slow down dissection significantly.  If so, perhaps what needs
to be done is to have the value of an FT_STRING field be a combination of an
ENC_ value and a raw blob of bytes copied directly from the packet, with the
blob converted to valid UTF-8 when necessary - with that conversion, for
ENC_UTF_8, getting rid of invalid UTF-8 sequences.


You are receiving this mail because:
  • You are watching all bug changes.