RE: CDATA by any other name... (was The raw and the cooked)

Rick Jelliffe (ricko@allette.com.au)
Sat, 31 Oct 1998 17:31:29 +1100


Henry Thompson wrote:

> The DOM made a serious mistake here in my opinion: it's
> stranded in no-person's-land between raw and cooked, without being
> either. It's not cooked, because it gives you EntityReference and
> CDATA nodes. It's not raw, because it DOESN'T give you character
> entity references.

CHARACTER REFERENCES
I think Henry means "numeric character reference", and this is the heart of
the matter. A numeric character is not an entity, any more than a
directly-entered character is. It is just an alternative encoding of the
character, and should be of no more interest to a general API than the
charset encoding of the document was. (I am putting words into his mouth: or
does Henry mean the [XMLs4.6] predefined entities?)

Even if you make
<!ENTITY example "&#123;">
The numeric character is not an entity: it is the value of an entity with
the name "example".

MARKED SECTIONS
On the subject of marked sections, I personally think that (in SGML) marked
sections should do more than just alter delimiter recognition: I think they
delimit anonymous inline entities, and label the entity with text-type
information. Unerlying this is that, marked sections actually mark up
notations: at ISO there has been discussion of whether to allow something
like (for example)
<![JAVA[ java code here ]]>

This is not something that I would expect to make its way into XML (and I
think the ISO people are now more keen to help XML/WebSGML than on tidying
up SGML) but I think the idea that a marked section not only alters
delimiter recognition but also labels the data can be seen (in embryo or
residually) in DOMs elevation of CDATAsection to node-worthiness, which has
so perplexed Henry.

If you take the view that CDATA section labels the data as character data
(i.e. not ignorable whitespace) then <![CDATA[ ]]> is clearly invalid in
Henry's example: because the " " is marked as data and data is not allowed.
But that is emphera: what does the spec say?

I think the answer is clear from the spec:
[43] content ::= (element | CharData | Reference | CDSect | PI | Comment)*
so a CDSect is not CharData. Therefore a CDSect is only valid in mixed
content, even though it is well-formed to have it in element content.

I think this is doubly clear from the discussion of "white-space" in [XML
2.10]: white-space for xml:space considerations (in element content) is
space added for "greater readability". <![CDATA[ ]]> does not do this!! It
disrupts readability. So from the purpose of valid whitespace in element
content it is clear that <![CDATA ]]> is not legitimate. The text is just as
important as the productions.

SPACES
Henry's problem brings up a further important consideation. XML gives an
attribute "xml:space" by which an application can know whether white:space
may be collapsed or not. Can <![CDATA[ ]]> be used to override
xml:space=default? The answer is NO, because

* an application is free to decide whether collapse spaces inside CDATA
marked sections or not;

* in PCDATA, ISO 10646 provides a specific character to indicate
non-collabsible whitespace: IDEOGRAPHIC SPACE &#x3000;

* outside mixed content <![CDATA[ ]]> is not valid for the reasons above.

XML, by adopting ISO 10646, takes the line that the only way to overcome the
problems that (ASCII) people have with spaces is to un-overload that damned
space character. The basic principle of markup is that if a user wants
something, they should unambiguosly mark it up in their data: if they want
non-collapsible space, the correct answer is "Use &#x3000;" or "Use
xml:space='preserve'". (However, font issues are important here: IDEOGRAPHIC
SPACE may be twice as wide as " " spaces, so the xml:lang attribute may be
important.)

I urge deve2lopers to make sure that their products handle the 17 ISO10646
spacing/hypenation characters properly. There have been previous postings on
this group, (what happen to that XML jewels website: it was there too?), or
get the Unicode book, or get ISO 10646, or (best option:-) get my book (XML
& SGML Cookbook, p 3-90).

Rick Jelliffe