I am using "tshark -T json -V -r file.pcap" and specifically I am looking for the gsm_sms.sms_text field.
I get this output:
"gsm_sms.sms_text": "Ok per\u00c3\u00b2 non piove"
Instead, using "tshark -V -r file.pcap" I get:
SMS text: Ok però non piove
(There is an accent in the "o" of "però")
The problem is that the \uXXYY syntax is UTF-16 (see [1]), while "ò" is UTF-8 and its bytes are c3 b2. Wireshark writes c3 b2 as they were UTF-16.
I solved the problem by changing print_escaped_bare() of epan/print.c as follow:
substitute
default:
if (g_ascii_isprint(*p))
fputc(*p, fh);
else {
g_snprintf(temp_str, sizeof(temp_str), "\\u00%02x", (guint8)*p);
fputs(temp_str, fh);
}
with
default:
fputc(*p, fh);
I do not know the Wireshark code, so I am not submitting a patch. This, however, should work because JSON supports UTF-8 (see again [1]).
[1] From the JSON page on Wikipedia: JSON exchange in an open ecosystem must be encoded in
UTF-8. However, if escaped, those characters must be written using
UTF-16 surrogate pairs, a detail missed by some JSON parsers.