An introduction to Structured Documents

Peter Murray-Rust

Virtual School of Molecular Sciences

Nottingham University, UK

About this paper

This paper is being written to accompany the publication on CDROM of ECHET96 ("Electronic Conference on Heterocyclic Chemistry"), run by Henry Rzepa, Chris Leach and others at Imperial College, London, UK. It is being sponsored by the Royal Society of Chemistry who (along with Cambridge, Leeds and IC) are participants in the CLIC project. This is one of the projects under E-Lib, a UK-based programme to promote electronic publishing. CLIC makes substantial use of SGML, and Chemical Markup Language (an SGML-based approach to molecular information management and publishing) is being developed in parallel with CLIC. The sponsors have agreed to make part of the CDROM available for CML material, of which this paper is part.

At the same time, the W3 consortium is promoting the use of SGML on the WWW, particularly through a simplified, easy-to-use, version called XML. Chemical Markup Language is written using XML and this paper is written in the belief that it may be useful those interested in the XML program, since CML is one of the first working applications of XML.

The paper assumes that the reader knows nothing about Markup Languages (other than an acquaintance with HTML). It is primarily aimed at those who are interested authoring or browsing documents with the next generation of markup languages, especially those created with XML. In its CDROM version it is accompanied by a structured document browser, JUMBO, which is a general XML browser written in Java, and is enhanced by being specifically extended to support CML and molecular applications. The CDROM contains a CML tutorial, many CML examples, and a number of screenshots of JUMBO displaying CML documents. For those of you reading this from a WWW page this material can be found under: The CML home Page. CML is part of the portfolio of the Open Molecule Foundation which is a newly constituted open body to promote interoperability in molecular sciences.

The paper alludes to various software tools, but does not cover their operation or implementation. However, with the exception of stylesheets, most of the operations described here for CML have already been implemented as a prototype using the JUMBO browser and processor. Nor is the paper a tutorial for CML as one is included in the CML distribution.

Finally I should emphasise that SGML can be used in many ways, and my approach does not necessarily do justice to the commonest use which is the management and publication of complex (mainly textual) documents. Projects in this area often involve many megabytes of data and industrial strength engines. I hope, however, that the principles described here will be generally useful.

Preamble

Two years ago I had never heard of structured documents, and have since come to see them as one of the most effective and cheapest ways of managing information. The basic idea is simple but when I first came across it I failed to see its importance, so this paper is written as a guide to what is now possible. In particular it explains the new simple language XML being developed by a working group (WG) of the W3 consortium. I have used this language as the basis for a markup language in technical subjects (TecML) and particularly molecular sciences (Chemical Markup Language, CML).

The paper is written as a simple structured document, using HTML, although it could have been written in CML. Since CML is being developed at the same time as XML readers may belong to two categories:

My hope is that both can read it without problems - the science is minimal and I hope that you can make the mental jump to other disciplines. However I shall slant it towards those who wish to carry precise, possibly non-textual, information arranged in (potentially quite complex) data structures. I shall use the term document, but this could represent a piece of information without conventional text such as a molecule. Moreover, documents can have a very close relation to objects and if you are comfortable with Object-Oriented language you may like to substitute 'object' for 'document'. In practice, XML documents can be directly and automatically transformed into objects, although the reverse may not always be quite so easy.

It will help if you know something about HTML, and you can relate the source of the document to its rendered form. It will be useful if you have been involved in authoring or editing HTML documents at the source level, and you shouldn't feel frightened of tags (strings of characters enclosed in diamond brackets <...>). The markup I shall introduce you to uses essentially the same syntax as HTML, and the main thing that may be new to you will be the concepts underneath this, rather than any new technology. I am primarily writing this paper in the context of document delivery over networks, but markup is also ideally suited to the management of 'traditional' documents. It is often seen as a key tool in making them 'future-proof' and interchangeable between applications (interoperability).

Some of what I say may appear trivial, perhaps just an exhortation to include some structure or navigation aids in your text. For a human reader this may be true, but for a machine (and the people who have to write the programs) it is of immense importance. I have seen several projects (including some of my own) which have tried to produce machine-readable information and failed because the nature of the task hadn't been appreciated.

The important point about the XML approach is that it has been designed to separate different parts of the problem and to solve them independently. I'll explain these ideas in more detail below, but one example is the distinction between syntax (the basic rules for carrying the information components) and semantics (what meaning you put on them and what behaviour a machine is expected to perform). This is a much more challenging area than people realise, since human readers don't have problems with it.

Introduction

One of the great polymaths of this century, J.D.Bernal, inspired the development of information systems in molecular science. In 1962 he urged that the problems of scientific information in crystallography (his own field) and solid state physics should be treated as one in communication engineering. 30 years on we have most of the tools that are required to get the best information in the minimum quantity in the shortest time, from the people who are producing the information to the people who want it, whether they know they want it or not. (Bernal's words, quoted in Sage, Maurice Goldsmith, p219.) I believe that structured documents, especially using markup languages such as CML/XML have a key role to play. Nothing comes free, but where this approach is possible it's very cost effective.

many scientists are unaware of the research during the last 30 years into the management of information. A recent and valuable review is: "Information Retrieval in Digital Libraries: Bringing Search to the Net", Bruce R. Schatz, Science, 275, pp. 327-334, (1997). [I shall comment on the format of the last sentence shortly.] In this Schatz shows that previous research in the analysis of complex documents, including hyperlinking, concept analysis, and vocabulary switching between disciplines is now possible on a production scale. Much of his emphasis is on analysis of conventional documents produced by authors who have no knowledge of markup and who do not use vocabularies. For that reason, complex systems are required to extract implicit information from the documents, and they rely on having appropriate text to analyse. Automatic extraction of numerical and other non-textual information will be much more difficult.

Structure and Markup

We often take for granted the power of the human brain in extracting implicit information from documents. We have been trained over centuries to realise that documents have structure (Table Of Contents (TOC), Indexes, Chapters with included Sections, and so on). It probably seems 'obvious' to you that you are reading the fourth section (Structure and Markup) in the paper (A simple introduction to structured documents). The HTML language and rendering tools which you are using to read it provide a simple but extremely effective set of visual clues such as the Chapter being set in larger type. However the logical structure of the document is simply:

HTML
  HEAD
    TITLE
  BODY
    H1 (Chapter)
    H2 (Section)
    H3 (Subsection)
    H3 
    H2 
    P  (Paragraph)
    P
    P
    P
    P
    H2
    P
    P
    P
    H2
    P
    P
    ... and so on ...
    ADDRESS
where I have used the convention of indentation the show that one component includes another. This is a common approach in many TOCs and human readers will implicitly deduce a hierarchy from the above diagram. But a machine could not unless it had sophisticated heuristics, and it would also make mistakes.

You may now find it useful to have a window open on your browser with the source of this document visible

The formal structure in this document is quite limited, and that is one of the reasons that HTML has been so successful. Humans can author them easily and human readers can supply the implicit structure. But if you look again at the TOC diagram you will see that Chapters do NOT include Sections in a formal manner, nor do Sections include Paragraphs. The first occurrence of H2 and H3 is used for the author and affiliation which is not a 'Section'.

An information component (an Element) contains another if the start-tag and end-tag of the container completely enclose the contained. Thus the HEAD element contains a TITLE element, and the TITLE element contains a string of characters (technically the term is #PCDATA). There's a formal set of rules in HTML for what Elements can contain what others and where they can occur. Thus it's not formally allowed to have TITLE in the BODY of your document. These rules, which you won't need to read, are called a Document Type Definition (DTD). They are written in a language called SGML, which you won't need to learn unless you do a great deal of work in this field.

[If you have already come across SGML and been put off for some reason, please don't switch off here. XML has been carefully designed to make it much easier to understand the concepts and there are many fewer terms For example, you don't even have to have a DTD if you don't want.]

This document has an inherent structure in the order of its Elements. Most people would reasonably assume that an H2 element 'belongs to' the preceding H1 , and that P elements belong to the preceeding H2. It would be quite natural to use phrases like "the second sentence of the second paragraph in the section called 'Introduction'". Humans can do this easily although it's easy to get lost in large documents. The important news is that XML now makes it possible for machines to do the same sort of thing with simple rules and complete precision. The Text Encoding Initiative (a large international project to markup the world's literature) has developed tools for doing this, and they will be available to the XML community.

[NOTE on HTML: In HTML there are no formal conventions for what constitutes a Chapter or Section, and no restriction as to what elements can follow others. Therefore you can't rely on analysing an arbitrary HTML document in the way I've outlined. This highlights the need for more formal rules, agreements and guidelines. In XML we are likely to see communities such as users of CML develop their own rules, which they enforce or encourage as they feel. For example, there is no restriction on what order Elements can occur in a CML document but there is a requirement that ATOMS can only occur within a MOL (molecule Element). (In CML I use the term ChemicalElement to avoid confusion). ].

In the Schatz reference (Introduction: Para 2, sentence 2), you will probably 'know automatically' what the components are. The thing in brackets must be the year, 'pp.' is short for 'pages', the bold type must be the volume, and the italics are the journal title. But this is not obvious to a machine, and trying to write a parser for this is difficult and error-prone. Many different publishing houses have their own conventions. The Royal Society of Chemistry might format this as: B. R. Schatz, Science, 1997, 275, 327. Any error in punctuation such as missing periods causes serious problems for a machine, and conversions between different formats will probably involve much manual crafting.

The precise components of the reference are well understood and largely agreed within the bibliographic community. They are a good example of something that can be enhanced by markup. Markup is the process of adding information to a document which is not part of the content but adds information about the structure or elements. Using the citation as an example, we can write:

<BIB>
  <TITLE>
  Information Retrieval in Digital Libraries: Bringing Search to the Net
  </TITLE>
  <JOURNAL>Science</JOURNAL>
  <AUTHOR>
    <FIRSTNAME>Bruce</FIRSTNAME>
    <INITIAL>R</INITIAL>
    <LASTNAME>Schatz</LASTNAME>
  </AUTHOR>
  <VOLUME>275</VOLUME>
  <YEAR>1997</YEAR>
  <PAGES>327-334</PAGES>
</BIB>
Even if they had never seen markup before most scientists would implicitly understand this information. The advantage is that it's also straightforward to parse it by machine. If the tags (<...>) are ignored, then the content is exactly the same as earlier (except for punctuation and rendering). It's often use to think of markup as invisible annotations on your document. Many modern systems do not markup the document itself, but provide a separate document with the markup. This is a feature of hypermedia systems and one of the goals of XML is to formalise this through the development of linking syntax and semantics in Phase II, but this is outside the scope of this paper.

What is so remarkable about this? In essence we have made it possible for a machine to capture some of those things that a human takes for granted.

Rules, meta-languages and validity

I started writing Chemical Markup Language because I wanted to transfer molecules precisely using ATOMS, BONDS and related information. It was always clear that 'chemistry' was more than this and that we needed the tools to encapsulate numeric and other data such as spectra. I looked at a wide variety of journals in the scientific area to see what sort of information was general to all of them and whether a markup language could be devised which could manage this wide range. It required a meta-language, and this section is an explanation of what that involves.

I'll explain the 'meta-' concept using XML and then show how it extends to applications such as TecML. XML, despite its name, is not a language but a meta-language (a tool for writing languages). XML is a set of rules which enable markup languages to be written and TecML and CML are two such languages. For example, one rule in XML is "every non-empty element must have a start-tag and an end-tag" so that the <AUTHOR> tag must be balanced by a </AUTHOR> tag. This is not a strict requirement of HTML, for example, which uses a more flexible set of rules. Another rule is "all attribute values must occur within quotes (")". Writing a markup language is a analogous to writing a program and the relation of XML to CML is much the same as C to hello.c. We say that CML 'is an application of XML', or 'is written in XML', just as 'hello.c is written in C.' XML is a little stricter than HTML in the syntax it allows but the benefit is that it's much easier to write browsers and other applications.

XML allows for two sorts of documents, valid and well-formed. Validity requires an explicit set of rules as a DTD which is usually a separate file, but can be included in the document itself. An example of a validity criterion in HTML is that LI (a ListItem) must occur within a UL or OL container. Well-formedness is a less strict criterion and requires simply that the document can be automatically parsed without the DTD and that the result can be The bibliographic example above is well-formed, but without a DTD may not be valid. It might have been an explicit rule that the author must include an element describing the language that the article was written in such as <LANGUAGE>EN</LANGUAGE>; in this case the document fragment would be invalid. The importance of validity will depend on the community using XML. In molecular science all *.cml documents will be expected to be valid and this is ensured by running them through a validating parser such as the free sgmls from James Clark. If a browser or other processing application such as a search engine can assume that a certified document was valid (perhaps from a validation stamp) there would be no need to write a validating parser. Being valid doesn't mean the contents are necessarily sensible and a further processor may be needed for that.

Where, and how, you enforce validity depends on what you are trying to do. If you are providing a form for authors to submit abstracts you will enforce fairly strict rules. ("It must have one or more AUTHORs, exactly one ADDRESS for correspondence, and the AUTHOR must contain either a FIRSTNAME or INITIALS but not both"). This can be enforced in a DTD. But this would be too restricting for a general scientific document, which need not always have an AUTHOR. The two forces of precision and flexibility often conflict, but can be reconciled to a large extent by providing different ways of processing documents.

Processing documents

At this stage it's useful to think about how an XML document might be created and processed. At its simplest level a document can be created with any text editor which is how the BIB example was written). It can then be processed with the human brain. This isn't a trivial point; there is no fundamental requirement for software at all or any stages of managing XML documents. In practice, however, software adds enormously to the value. CML documents such as those including atomic coordinates only make sense when rendered by computer.

A general authoring process can be represented as:


               stylesheets

Authoring       assembly                      validation
Validation ------ // ----> parsing & validation --> postprocessing
                serving 
Editing                                        rendering
Conversion     objects/Java

The break (//) signifies where the document is transferred from author/server to client/reader. Not all XML applications will fit this simple model, but it serves to highlight the components:

This has been a long section, but I hope it shows that XML is not simply a document processing language.

Attributes

So far I have only used Element names (often called GIs) to carry the markup. XML also provides attributes as another way of modulating the element. Attributes occur within start-tags, and well-known examples from HTML are HREF (in A) and SRC (in IMG):
<A HREF="http://www.venus.co.uk/omf/cml/">
<IMG SRC="mypicture.gif" WIDTH="500" HEIGHT="100">.
Attributes are semantically free in the same way as Elements, and can be used with stylesheets or Java classes to vary their meaning.

Whether Elements or attributes are used to convey markup is a matter of preference and style, but in general the more flexible the document the more I would recommend attributes. As a point of style, many people suggest that document content should not occur in attributes, but this is not universal. Here are some simple examples of the use of attributes:

In XML-link attributes will be extensively used.

Flexibility and meta-DTDs

When developing an XML application the author has to decide whether precision and standardisation is required, or whether it is more important to be flexible. If precision is required, then the DTD will be the primary means of enforcing it and as a consequence may become large and complex. It implies that the 'standard' is unlikely to change. When new versions are produced, the complete pipeline from authoring to rendering will need to be revised. As this is a major effort and cost, careful planning of the DTD is necessary.

If flexibility is is more important, either because the field is evolving or because it is very broad, a rigid DTD may restrict development. In that case a more general DTD is useful, with flexibility being added through attributes and their values. So in TecML I have created a Element type XVAR, for a scalar variable. I use attributes to tune the use and properties of XVAR and it's possible to make it do 'almost anything'! For example it can be given a TYPE such as STRING, FLOAT, DATE and a TITLE. In this way any number of objects can be precisely described. Here are three examples:

<XVAR TYPE="STRING" TITLE="Greeting">Hello world!</XVAR>
<XVAR TYPE="DATE">2000-01-01</XVAR>
<XVAR TYPE="FLOAT" DICTNAME="Melting Point" UNITS="Fahrenheit">451</XVAR>
The last is particularly important because it uses the concept of linking to add semantics. This is a big feature of XML , and the precise syntax is being developed in XML-Phase-II. CML uses DICTNAME to refer to an entry in a specified glossary which defines what "Melting Point" is. This entry could have further links to other resources such as world collections of physical data. Similarly I use UNITS to specify precisely what scale of temperature is used. Again this is provided by a glossary in which SI units are the default. By using this approach it is possible to describe any scalar variable simply by varying the attributes and their values. Note that the attribute types must be defined in the DTD but their values may either be unlimited or can be restricted to a set of possible values

The TecML DTD uses very few Element types, and these have been carefully chosen to cover most of the general concepts which arise in technical subjects. They include ARRAY, XLIST (a general tool for data structures such as tables and trees), FIGURE (a diagram), PERSON, BIB, and XNOTATION. (NOTATION is an XML concept which allows non-XML data to be carried in a document, and is therefore a way of including 'foreign' file types). With these simple tools and a wide range of attributes it is possible to markup most technical scientific publications. Areas which are not covered are: parsable mathematics, fine-grained markup in diagrams, and anything that involves complex relationships. Of course there has to be general agreement about the semantics of the markup but this is a great advance compared with having no markup at all. In some cases where adequate methods have been developed for well defined components those can be encapsulated and need not be translated. Examples are NETCDF for multidimensional data and VRML for 3-D graphics.

Searching

It was a revelation when I realised the power of structured documents (SD) are for carrying information. I think that data in many disciplines map far more naturally into a tree structure than into a relational database (RDB). An SD has a concept of sequential information while an RDB does not. The exciting thing is that the new Object databases (including the hybrid Object-Relational Databases (ORDBS)) have the exact architecture which is needed to hold XML-like documents, and suppliers now offer SGML interfaces. (For any particular application, of course, there may be a choice between RDBs and ORDBs.) The attraction of Objects over RDBs is that it is much easier to design the data architecture. In many cases simply creating well marked-up documents may be all that is required for their use in the databases of the future.

The reason for this confident statement is that SDs provide a very rich context for individual Elements. Thus we can ask questions like:

Despite their apparent complexity, all these can be managed with standard techniques for searching structured documents. Because of this power, a special language (Structured Document Query Language - SDQL) has been developed and will interoperate with XML. If simple application-specific tools are developed then queries like the following are possible:

Summary, and the next phase

This document has described only part of what XML can offer to a scientific or publishing community. XML has three phases, and only the first has been covered here. Phase II is to define a hyperlinking system; and Phase III to define how style sheets will be used. Hyperlinking can range from the simple, unverified link (as in HTML's HREF attribute for Anchors) to a complete database of typed and validated links over thousands of documents. Phase II is addressing all of these and has the power to support complex systems.

Technical aspects and the future

How will XML develop in practice? A natural impetus will come from those people who already use SGML and see how it could be used over the WWW. It is certainly something that publishers should look at very closely as it has all the components required, including the likelihood that solutions will interoperate with Java.

XML is the ideal language for the creation and transmission of database entries. The use of entities means it can manage distributed components, it maps well onto objects and it can manage complex relationships through its linking scheme. Most of the software components are already written.

How would it be used with a browser? Assuming that the bulk of tools are written in Java, we can foresee helper applications or plugins, and perhaps there will be more autonomous tools which are capable of independent action. It's an excellent approach to managing legacy documents rather than writing a specific helper for each type.

I hope that there will be enough tools that XML will provide the same creative and expressive opportunities that HTML has done. However, it's important to realise that freely available software is required and any tools for structured document management, especially in Java, will be extremely welcome.

References

The SGML and XML community has excellent WWW resources and so it is unecessary to give a large list of pointers. Some key sites are:

© Peter Murray-Rust, 1996, 1997