> To extend the available characters in Unicode one
> can use to 16 bit characters with surrogate blocks.
If you want to extend the available characters with your own, use the
"Private-Use" or "user-defined" character block. The surrogates are
codepoints reserved for messing up software later as more registered
national characters sets are added; they are not for private use;
implementors of current systems can ignore them at least for the next year,
as far as I know.
FIRST check that your character could not be represented by using an
existing ISO 10646 character with some appropriate attribute on the element.
In particular, if it is a regional variant of a character, try to use the
xml:lang attribute. Note that a "language" includes far more than just
simple regional language: I could have xml:lang='en-US-legal' to indicate US
legalese; or it could be xml:lang='x-physics' to indicate that it is using
the language of physics, but this language has not been recognised by IANA:
in this case, your stylesheet can say "Oh, this is an X, but an X to be
rendered as physicists will want it rendered."
NEXT note that if you need mathematical characters, check out MML
http://www.w3.org/TR/REC-MathML/chapter6.html
first.
FINALLY there are two contradictory needs for a user-defined character:
searching (collation) and display. Which fits you?--
If your primary need is DISPLAY, then it is better to use an entity
reference for the character. The corresponding entity contains an element
with a hypertext reference to the glyph of the character: e.g.
<!ENTITY my-alpha "<http:img src='url'/>">
If your system is smart, you could use content-negotiation to get the best
form: GIF or whatever. (And it lets you tie into some Web fonts system, as
that becomes available.) If you also need a little bit of collatability,
you could add an attribute to indicate collation sequence posisition.
If your primary need is for simple SEARCHING (collation) rather than
presentation, then use the Private-Use area. (In the Private-Use characters,
avoid using E200-E600; MML uses them.) You should always enter any of the
Private-Use area characters using a numeric character reference (or, if you
use these characters more than once, or want to provide a modicom of
documentation, define an entity for them and use an entity reference)-- this
will prevent possible transcoding errors later, and also makes the text more
readable in editors which do not allow private-use characters to be added.
(Western readers may be surprised that allowing user-defined characters is
not uncommon in CJK publishing software, since the standard sets only go so
far, even though it is almost unheard of in the West.)
Rick Jelliffe
<kisses xml:lang='x-love'>XXX</kisses>