Wireshark-bugs: [Wireshark-bugs] [Bug 8382] MS-MMS dissector crash
Date: Sun, 03 Mar 2013 02:22:36 +0000
Comment # 30
on bug 8382
from Guy Harris
(In reply to comment #21) > - tvb_get_string_* functions should preserve embedded nulls, as they should > be just dumb getters from the tvb If so, then the string might not be usable by the dissector in, for example, comparisons, as the string won't be guaranteed to be in a C-string-style representation (string of octets, all ASCII characters represented as single octets containing the ASCII value of the character). If the dissector is just adding a string value, without needing to look at the value, it should use proto_tree_add_item(). If it needs to look at the value, it should use a tvb_get_string_* routine that takes an encoding argument, and either: 1) the routine should return a C string (null-terminated, meaning we can't handle embedded nulls) or 2) the routine should return something such as a counted string, so that we can handle embedded nulls, and we should provide routines that handle that type of string and the string's encoding should be UTF-8. > - tvb_get_string_* functions should provide the length of the returned > string, since with multi-byte characters and embedded nulls, the caller has > no other way of determining this value "Length" in octets or "length" in characters? That sounds like case 2), where we use counted strings. > - the ms-mms dissector should be checking for overflow when doubling a value > it gets from the packet > > - the ms-mms dissector should be using proto_tree_add_unicode_string for > strings it gets via tvb_get_ephemeral_unicode_string > - it should be possible to add multibyte unicode strings (UCS-2, UTF-16, > UCS-4, etc) to the tree with proto_tree_add_item and the appropriate > encoding arguments Yes. This raises the question of how the value of a string should be stored in the protocol tree. Two options I see are: 1) store it as some Unicode encoding (UTF-8, UTF-16, or UCS-4); 2) store it as a raw collection of octets, along with the encoding used, and convert it to some Unicode encoding when needed. "When needed" would include: when presenting it in human-readable form; when presenting it in PDML or some other form intended for machine reading; when providing it to code for use in, for example, a tap; when comparing with it in a display filter. If there's anything in the string that *can't* be converted (e.g., an invalid UTF-8 sequence, or a character in some non-Unicode set that has no Unicode equivalent, if any such characters exist), then: when presenting it in human-readable form, we'd have to convert it to something to show to the user (Unicode REPLACEMENT CHARACTER?); when presenting in PDML, we'd have to do something; when providing it for code to use, we'd have to do something (REPLACEMENT CHARACTER plus a "sorry, I couldn't convert this completely" indication?); when comparing it in a display filter, cause all comparisons other than "not equal" to fail (unless we have a way to specify comparison against arbitrary octet strings). > - calling format_str is necessary when passing a packet-derived string into > proto_tree_add_string, col_append_fstr and friends. These functions could > potentially call format_str themselves on every string they get, but the > performance hit would be big. Faster to continue calling it only where > needed. When presenting strings in human-readable form, we need to figure out what to do about non-printable characters. The format_str mechanism of showing them as C string escapes might work, although would we then have to escape all backslashes in the string? That'd look Really Weird in the SMB dissector, for example, with pathnames filled up with \\. Perhaps string fields should have a display option indicating whether to use C string escapes (complete with escaping of backslashes) or to just show non-printable characters either as REPLACEMENT CHARACTER or, at least for C0 control characters, as the SYMBOL FOR xxx characters (0x2400+{value} for 0x00 through 0x1F and 0x2421 for 0x7F). It might also be useful to have some way in the UI to look at the raw bytes corresponding to individual glyphs, in case, for example, you're trying to debug a problem with a client that's sending bad UTF-8 sequences over the wire, or a problem with decomposed vs. composed accented characters. My personal inclination is to store string values as encoding+raw bytes, doing the conversion to UTF-8 lazily, and: Having protocol.string {comparison} "hello" being done either by converting protocol.string to UTF-8 and comparing the UTF-8 strings or converting "hello" to the encoding, caching the result, and failing if the conversion fails (>, >=, <, and <= might be done by converting to UTF-8; they're probably less common, and they raise questions about what order to impose on strings in any case). The latter would probably involve fewer conversions. Having protocol.string {==,!=} xx:xx:xx:xx:xx:xx:xx:xx... be done by comparing the sequence of octets with the raw bytes. Having the "apply as filter" and "prepare as filter" stuff, when you try to construct a filter from a string field, use the sequence-of-octets form if the value can't be converted to Unicode and the string form otherwise (although that would get tricky if the Unicode string contains non-printable characters).
You are receiving this mail because:
- You are watching all bug changes.
- Prev by Date: [Wireshark-bugs] [Bug 8416] remove C++ incompatibilities
- Next by Date: [Wireshark-bugs] [Bug 7885] Buildbot crash output: fuzz-2012-10-21-7332.pcap
- Previous by thread: [Wireshark-bugs] [Bug 8382] MS-MMS dissector crash
- Next by thread: [Wireshark-bugs] [Bug 8382] MS-MMS dissector crash
- Index(es):