Wireshark-bugs: [Wireshark-bugs] [Bug 1827] problem with accentuated letters
Date: Thu, 17 Sep 2009 19:55:32 -0700 (PDT)
https://bugs.wireshark.org/bugzilla/show_bug.cgi?id=1827 --- Comment #18 from Lee Carré <leec100@xxxxxxxxxxx> 2009-09-17 19:55:27 PDT --- (In reply to comment #16) > This is looking like another one of those issues that is caused by us not using unicode or character set specific strings to hold everything in :(. Character encoding is a tricky subject. Much like when attempting to read a (graphical) image file, you need to know in which format the image data is saved, to correctly/reliably interpret/parse a text string, you need to know which encoding was used to produce it. Even within Unicode itself, there are several quite different schemes/methods, each with different goals and pros/cons. An unspecified encoding will forge unreliable assumptions. The encoding must always be specified (even when using Unicode, there are many different ways of encoding abstract characters into binary data). Although it may be possible to auto-detect which (Unicode) encoding was used, that just adds complexity (read: bug-fertiliser), and can be avoided anyway. Sadly, character encoding issues really need to be something addressed before implementation begins. Once it has begun, fixing the code is a more tricky problem, as to support Unicode correctly, the whole application needs to natively support Unicode throughout, from input to output (other encodings can easily be mapped to Unicode on input, and back to whatever encoding on ouput, if needed). Unicode also presents additional processing requirements when it comes to certain characters, as ‘pre-composed characters’ (such as those with diacritics) can be represented in several different ways. In this case “é” can be represented in at least two ways: firstly by the single “U+00E9 LATIN SMALL LETTER E WITH ACUTE” character, alternatively by using the base “U+0065 LATIN SMALL LETTER E” character, followed immediately by the “U+0301 COMBINING ACUTE ACCENT” character. Both of which are (1) intended to represent the same logical character, (2) to be treated as equivilent of each other. The reason for the existance of multiple methods, is to fulfil multiple requirements simultaneously. The first is compatibility with older character encondings. The second is to avoid any need to have any (further) pre-composed characters. There are several dozen different diacritics (U+0300 to U+036F, et al.), to represent all possible combinations would run to several hundred (possibly the low thousands — there are many base characters, dozens of diacritics, and some logical characters have multiple diacritics, in certain languages) of pre-composed characters. Combining Characters are a way to provide the components, so that authors can combine them in almost any way they wish, as they need to. This avoids any further proliferation of, and polution by, pre-composed characters. -- Configure bugmail: https://bugs.wireshark.org/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug.
- Prev by Date: [Wireshark-bugs] [Bug 4026] New: New Packet Dissector - jmirror
- Next by Date: [Wireshark-bugs] [Bug 3922] TN5250 Dissector
- Previous by thread: [Wireshark-bugs] [Bug 1827] problem with accentuated letters
- Next by thread: [Wireshark-bugs] [Bug 4026] New: New Packet Dissector - jmirror
- Index(es):