UTF-8

Richard Emberson (emberson@faslab.com)
Fri, 16 Oct 1998 15:48:38 -0700


Does the UTF-8 encoding require that the minimum byte count
be used when a character is encoded.
Recall that the form of a UTF-8 encoding is:

0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

So one could, for example, claim that:

00111111

and

11000000 10111111

represent the same character, #x3F, or

11110001 10111111 10111111 10111111

and

11111000 10000001 10111111 10111111 10111111

represent #x7FFFF (note: x10000 < x7FFFF < x10FFFF as so is legal).

The reason I ask is whether an XML parser has to worry about
5 and 6 byte UTF-8 encodings or can it *allways* assume that the
values represented by such encoding are not legal unicode characters.

Thanks.

Richard Emberson