Wireshark-bugs: [Wireshark-bugs] [Bug 1827] problem with accentuated letters

Date: Thu, 17 Sep 2009 19:55:32 -0700 (PDT)
https://bugs.wireshark.org/bugzilla/show_bug.cgi?id=1827





--- Comment #18 from Lee Carré <leec100@xxxxxxxxxxx>  2009-09-17 19:55:27 PDT ---
(In reply to comment #16)
> This is looking like another one of those issues that is caused by us not using unicode or character set specific strings to hold everything in :(.

Character encoding is a tricky subject.
Much like when attempting to read a (graphical) image file, you need to know in
which format the image data is saved, to correctly/reliably interpret/parse a
text string, you need to know which encoding was used to produce it. Even
within Unicode itself, there are several quite different schemes/methods, each
with different goals and pros/cons.
An unspecified encoding will forge unreliable assumptions.

The encoding must always be specified (even when using Unicode, there are many
different ways of encoding abstract characters into binary data). Although it
may be possible to auto-detect which (Unicode) encoding was used, that just
adds complexity (read: bug-fertiliser), and can be avoided anyway.

Sadly, character encoding issues really need to be something addressed before
implementation begins. Once it has begun, fixing the code is a more tricky
problem, as to support Unicode correctly, the whole application needs to
natively support Unicode throughout, from input to output (other encodings can
easily be mapped to Unicode on input, and back to whatever encoding on ouput,
if needed).
Unicode also presents additional processing requirements when it comes to
certain characters, as ‘pre-composed characters’ (such as those with
diacritics) can be represented in several different ways. In this case “é”
can be represented in at least two ways: firstly by the single “U+00E9 LATIN
SMALL LETTER E WITH ACUTE” character, alternatively by using the base
“U+0065 LATIN SMALL LETTER E” character, followed immediately by the
“U+0301 COMBINING ACUTE ACCENT” character. Both of which are (1) intended
to represent the same logical character, (2) to be treated as equivilent of
each other. The reason for the existance of multiple methods, is to fulfil
multiple requirements simultaneously. The first is compatibility with older
character encondings. The second is to avoid any need to have any (further)
pre-composed characters. There are several dozen different diacritics (U+0300
to U+036F, et al.), to represent all possible combinations would run to several
hundred (possibly the low thousands — there are many base characters, dozens
of diacritics, and some logical characters have multiple diacritics, in certain
languages) of pre-composed characters. Combining Characters are a way to
provide the components, so that authors can combine them in almost any way they
wish, as they need to. This avoids any further proliferation of, and polution
by, pre-composed characters.


-- 
Configure bugmail: https://bugs.wireshark.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.