Wireshark-dev: Re: [Wireshark-dev] [Wireshark-commits] rev 53819: /trunk/epan/ /trunk/epan/diss

From: Guy Harris <guy@xxxxxxxxxxxx>
Date: Sat, 7 Dec 2013 14:42:16 -0800
On Dec 7, 2013, at 2:10 AM, darkjames@xxxxxxxxxxxxx wrote:

> http://anonsvn.wireshark.org/viewvc/viewvc.cgi?view=rev&revision=53819
> 
> User: darkjames
> Date: 2013/12/07 10:10 AM
> 
> Log:
> Add new string proto encoding for windows-1250 (ENC_WINDOWS_1250)
> 
> - Move windows-1250 to unicode encoding table to charset.c
> - Add tvb_get_string_unichar2, tvb_get_stringz_unichar2 functions which recode tvb-string to UTF-8.

Note that

	https://developer.gnome.org/glib/stable/glib-Unicode-Manipulation.html#gunichar2

says of a gunichar2 that it is

	A type which can hold any UTF-16 code point[4].

with the footnote:

	https://developer.gnome.org/glib/stable/glib-Unicode-Manipulation.html#ftn.utf16_surrogate_pairs

saying

	[4] surrogate pairs

This means that a gunichar2 can hold either

	1) a character from the Basic Multilingual Plane (BMP) of Unicode:

		https://en.wikipedia.org/wiki/Basic_Multilingual_Plane#Basic_Multilingual_Plane

or

	2) a surrogate pair:

		https://en.wikipedia.org/wiki/UTF-16#Code_points_U.2B10000_to_U.2B10FFFF

so those routines can handle only encodings that don't include characters outside the BMP.

This is probably true of most non-Unicode encodings, such as the ISO 8859-n encodings, so it's OK for them, but be careful when using them.