Re: Binary Data in XML (a first pass?)

Chris Smith (smith@interlog.com)
Thu, 1 Oct 1998 00:34:26 -0400 (EDT)


On Wed, 30 Sep 1998, Tim Bray wrote:

> From: Tim Bray <tbray@textuality.com>
> Subject: Re: Binary Data in XML
>=20
> Suppose I wrote up a NOTE, should occupy less than one page, proposing
> a reserved attribute xml:packed with, for the moment, only two
> allowed values, "none" and "base64". The default value is "none".
> If an element has xml:packed=3D"base64" this means that
>=20
> (a) the content of the element to which this is attached must be
> pure #PCDATA, no child elements and no references, and
> (b) the content is encoded in base64, leading and trailing spaces allowed

If I may be so bold -- this was addressed in the development of Open
Trading Protocol, and we simmered it down to a concise form. Don Park
and Gavin McKenzie helped sift out an standalone form, which then went
*back* into the OTP group. The result is summarized below, minus most
of the document surrounds.

Use of this internally in OTP has allowed for encapsulation and
a framework structure without a lot of development overhead.

----------------------------------------------------------------

XML Packaging

Basic Goals

It is suggested that you read the entire document, since
there are some forward references in the Goals section that
may only make sense after reading through the whole thing.

1)=09Inclusion of a variety of items

This variety of items can potentially be defined
dynamically by the groups/parties/systems involved. Some
systems will be "static" implementations - not driven
directly by the DTD, but using a parser and embedding the
understanding of the DTD in the system itself. It is for
this reason that parameter entities are not used. Some
people will only develop their system from the DTD, not run
their system using it.

2)=09Simplest possible inclusion of plain text items

This means so simple that it should look like PCDATA. More
to the point, XML can already handle the plain text case,
so we should not have to step out to something else (MIME
or otherwise) to handle plain text.

3)=09Easy inclusion of graphic or other binary entities.

This is for the cases where most groups would agree what is
desired (ie a GIF or JPEG), but XML does not allow for
direct embedding. This is the target for the MIME:mimetype
allowance. Data can be directly converted using standard
BASE64 routines, and no generation or checking of headers
needs to be done.

4)=09Leverage MIME power!

This is the origin of the generalized MIME allowance. In
particular, MIME:mimetype simply can=92t work with multipart
types.

5)=09Allow for private customization.

This is a somewhat contentious inclusion. It can be argued
that private customization can already be achieved
using the MIME application/x-private notation, so why
duplicate that capability?

However, there is a growing body of XML =91private
customization=92, and it would be preferable not to have to
go through MIME in order to get to it. The XML content
provides a straightforward indication that the content,
likely straight PCDATA (not transformed), is an embedded
XML document, perhaps XML/EDI. I also think this is wise,
since those doing there work in XML may not provide a
standardized private MIME label for their work.

For exactly the same reasons, the general MIME availability
should be kept as well. If a group has a reasonably
standardized MIME label for a private custom format, then
we need the full MIME capability to support it.

Finally, the x-ddd:usercode version has already proven
useful where different parts of a system may communicate
using this mechanism, because it=92s easier than trying to
communicate through a (non-existent!) private channel.

6)=09To be used in place of ANY

ANY content is understandably difficult to parse. There may
or may not be guidelines to help you. It is preferable to
match extensibility with a little more structure.

DTD For Package

This then leads to a very compact DTD item (more
definitions below).

<!ELEMENT Package (#PCDATA)>
<!ATTLIST Package
content CDATA "PCDATA"
transform (NONE|BASE64) "NONE"
>

Note that any special details, especially custom
attributes, must be represented at a higher level. For
example:

<!ELEMENT SpecializedData (Package)>
<!ATTLIST SpecializedData
ID ID #REQUIRED
CustomerId CDATA #IMPLIED
PaymentId CDATA #IMPLIED
SoftwareId CDATA #IMPLIED
>

Detailed interpretations of the attributes follow:

Attribute: content

The content attribute defaults the the value "PCDATA", to
imply that the content consists only of legal PCDATA
characters for XML. When used in this manner, the content
of the Package element effectively substitutes for a simple
#PCDATA content in the parent element.

Attribute value for "content": PCDATA

The content of Package can be treated as PCDATA with no
further processing.

Attribute value for "content": MIME

The content of Package is a complete MIME item. Processing
should include looking for MIME headers inside the Package
content.

Attribute value for "content": MIME:mimetype

The content of Package is MIME content, with the following
headers implied:

Content-Type: mimetype

Although it is possible to have MIME:mimetype with
transform=3D"NONE", it is far more likely to have
transform=3D"BASE64". Note that if transform=3D"NONE" is used,
then the entire content must still conform to PCDATA. Some
characters will need to be encoded either as the XML
default entities, or as numeric character entities.

Attribute value for "content": XML

The content of Package can be treated as an XML document.
This document may include an XML declaration, and it may
refer to a different DTD than that of the enclosing
document.

Character entities and CDATA sections, or
transform=3D"BASE64", must be used to ensure that the Package
contents are legitimate PCDATA. Enclosing a raw XML
document will cause parsing errors while attempting to
parse the enclosing document.

The well-formedness or validity of the document inside the
Package has no effect on the parsing of the enclosing
document. Obviously, a non-well-formed or invalid inclusion
may still cause errors within an application. However, for
some reasons, such as user support, there are legitimate
reasons to enclose XML documents that are not well-formed.

Attribute value for "content": x-ddd:usercode

The content is private, where ddd represents a domain name
of a user, and usercode represents a particular content
format defined by that user.

The guidelines around a x-ddd are very loose. Given company
FFGGHH Inc, all of x-www.ffgghh.com, x-ffgghh.com and
x-ffgghh are legitmate examples. However, only one should
be the correct format, as defined by FFGGHH Inc.

The usercode mechanism is intended to reduce the
possibility of content attribute collisions, not to provide
a mechanism that can eliminate them entirely.

Attribute: transform

Attribute value for "transform": NONE

The PCDATA content of Package is the correct representation
of the data. Note that entity expansion must occur first
(ie replacement of &amp; and &#9;) before the data is
examined.

CDATA sections may legimately occur in a Package marked
transform=3D"NONE".

Attribute value for "transform": BASE64

The PCDATA content of Package represents a BASE64 encoding
of the actual content. Although entity expansion must occur
before decoding of the Base 64 stream, it is not expected
that this will happen under normal circumstances.

=2E..Chris Smith
=2E..Don Park
=2E..Gavin McKenzie

---------------------------------------------------------------------------
Chris Smith <smith@interlog.com>