Wireshark · Wireshark-dev: Re: [Wireshark-dev] tvb_get_string_enc() doesn't always return valid UTF-8

Wireshark-dev: Re: [Wireshark-dev] tvb_get_string_enc() doesn't always return valid UTF-8

Date: Mon, 20 Jan 2014 15:05:42 -0500

On Mon, Jan 20, 2014 at 2:52 PM, Jakub Zawadzki
<darkjames-ws@xxxxxxxxxxxx> wrote:
> Hi,
>
> On Mon, Jan 20, 2014 at 06:22:37PM +0100, Martin Kaiser wrote:
>> if I have a tvbuff that starts with 0x86 and I call
>>
>> a = tvb_get_string_enc(tvb, 0, ENC_ASCII)
>> proto_tree_add_string(..., a);
>>
>> I can trigger the DISSECTOR_ASSERT since a is not a valid unicode string.
>>
>> Comments in the code suggest that tvb_get_string() should replace
>> chars>=0x80 with the unicode replacement char, which is two bytes long.
>> This would look like
>> [...]
>>
>> The resulting string would still contain len+1 chars but not necessarily
>> len+1 bytes. Would that be a problem, i.e. is it ok to do sth like
>>
>> b = tvb_get_string(NULL, tvb, offset, len_b);
>> copy_of_b = g_malloc(len_b+1);
>> memcpy(copy_of_b, b, len_b+1);
>
> If you just want to duplicate string you should definitely use g_strdup() ;-)

As long as you can guarantee there won't be embedded nulls.

>> If that should work, we'd need a separate function for get string &
>> replace 8bit chars.
>
> I think we don't need, tvb_get_string_enc(, ENC_ASCII) should return valid UTF-8 string,
> and all callers assuming it's just 1:1 copy are buggy.
>
> Maybe we should add: ENC_STRING_DONT_CONVERT, if people want just to
> have NUL terminated string?
>
>
> btw. I really wonder if current way of using a replacement character is good one.
> Maybe we should escape it to some: \x86.
> ___________________________________________________________________________
> Sent via:    Wireshark-dev mailing list <wireshark-dev@xxxxxxxxxxxxx>
> Archives:    http://www.wireshark.org/lists/wireshark-dev
> Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
>              mailto:wireshark-dev-request@xxxxxxxxxxxxx?subject=unsubscribe

References:
- [Wireshark-dev] tvb_get_string_enc() doesn't always return valid UTF-8
  - From: Martin Kaiser
- Re: [Wireshark-dev] tvb_get_string_enc() doesn't always return valid UTF-8
  - From: Jakub Zawadzki

Prev by Date: Re: [Wireshark-dev] tvb_get_string_enc() doesn't always return valid UTF-8
Next by Date: Re: [Wireshark-dev] tvb_get_string_enc() doesn't always return valid UTF-8
Previous by thread: Re: [Wireshark-dev] tvb_get_string_enc() doesn't always return valid UTF-8
Next by thread: Re: [Wireshark-dev] tvb_get_string_enc() doesn't always return valid UTF-8
Index(es):
- Date
- Thread