On Apr 4, 2014, at 2:01 PM, Hadriel Kaplan <hadriel.kaplan@xxxxxxxxxx> wrote:
> For protocols which are actually truly UTF-8, I'm planning to just assume treating them as ASCII is ok, because as far as I know the atoi/strtol/etc. functions don't actually care: if they see the ASCII characters for digits (and +/-/etc.) they'll parse it, else not. So any non-ASCII UTF-8 character in the sequence is meaningless to them and they stop parsing at that character.
Yes, the only valid octets in a number in any "extended ASCII" would be:
0x2b, 0x2d, 0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37
0x38 and 0x39 if the radix is 10 or 16;
0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x61, 0x62, 0x63, 0x64, 0x65, and 0x66 if the radix is 16;
so anything with the 8th bit set is not valid, meaning that the same routine can handle ASCII, ISO 8859-n, various Windows code pages, various Mac code pages, and UTF-8 - the actual character encoding is irrelevant, as long as ASCII characters are encoded as a single octet having the ASCII code point value.