Re: there's empty, and then there's REALLY empty

Henry S. Thompson (ht@cogsci.ed.ac.uk)
30 Oct 1998 10:15:12 +0000


I offered:

<!DOCTYPE foo [
<!ELEMENT foo (a+)>
<!ENTITY empty ''>
<!ENTITY space ' '>
<!ELEMENT a EMPTY>]>
<foo>
&empty;
<a/>
&space;
<a/>
<![CDATA[]]>
<a/>
<![CDATA[ ]]>
<a/>
</foo>

<david@megginson.com> wrote:

[Don't use nsgmls as a reference validator for XML]

John Cowan writes:

[both CDATA sections should be rejected]

OK, so the cat is out of the bag, nsgmls does indeed generate ONE
error, for the <![CDATA[ ]]>.

I'm not under any illusions about nsgmls as a reference standard, it's
really that I can't make sense of an interpretation of 3.2.1 which
ALLOWS the two entity references and (what John and, if I read David's mesage
correctly, David, expect, and as at least the online validating
parser from Richard Goerwitz/the Brown Scholarly Technology Group
does) disallows the two CDATA sections.

[Note that expat is NOT a validating parser, and swallows the file
without comment]

The problem here is a fundamental one with respect to the very nature
of validation and well-formedness, which I know the editors were aware
of, but could not fully resolve at the time of publication. I think
of it as the 'raw' versus 'cooked' problem. Most of the
well-formedness constraints are on the raw document, that is, the
input character sequence as such, pre-interpretation of e.g. entity
references, marked or CDATA sections. Most of the validity
constraints, in particular the content-model enforcing ones, are on
the cooked document, that is, the effective character sequence AFTER
interpretation of . . . well, that's the problem, isn't it. In order
to make sense of a claim that the two entity references in my example
are valid, but the two CDATA sections are not, we are left in the
difficult position of saying that the validity constraint in question
applies AFTER entity expansion but BEFORE CDATA section
interpretation, which is really weird, because wrt e.g. mixed content
models, the constraint clear applies AFTER CDATA section
interpretation.

So my conclusion is that in fact consistency requires that CDATA
sections containing nothing but whitespace SHOULD be valid as part of
the content of element-only content element types. In any case, I
think this issue needs to be clarified in any corrigendum which may be
forthcoming.

ht

Resource note:

I tried all the online validators listed at
http://www.oasis-open.org/cover/check-xml.html:

* The STG one worked as discussed above;

* the Koala one showed me blank pages no matter how I
invoked it;

* the xml.t2000.co.kr offers a bewildering array of (to me
confused) choices, including validation with or without a DTD (?), but
in any case rejected both the entity references and both the CDATA
sections;

* the WebTech validator appears to be using SP, but set up incorrectly for XML

So from four 'validation services', four different answers. I rest my
case.

ht

-- 
  Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
     2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
	    Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk
		     URL: http://www.ltg.ed.ac.uk/~ht/