http://www.ch.ic.ac.uk/rzepa/chimia/index.html#introduction
The first component of this URL up to the # defines a resource, i.e. a mechanism to obtain a document from a designated global location, although it could also be a processing resource such as a search query rather than a document. The string after the # specifies a document fragment, or more generally a definition of actions to be performed on the document or by the processing resource. The old-style session in this model is thus replaced with a collection of hyperlinked documents and resources that together defined a structured set of resources.
1. text/html 2. image/gif 3. chemical/x-mdl-molfile 4. model/vrmlThe first of these is simply a text-based document, with the expectation that the internal structure of that document will be "marked-up" using the HTML guidelines. Based simply on the globally accepted definition of version 2 of HTML, in combination with equally standard MIME declarations and the URL definition shown above, it had become possible by mid 1994 for a single person, with only limited resources at their disposal, to create a text-searchable index of most of the Internet. Such search facilities are now commonplace, although their utility for retrieving chemical information has been questioned.3 I will return to this theme later in the article.
Finally, we will introduce chemical specifics into this discussion! The very earliest chemical content referenced within HTML type documents was in fact via hyperlinks to images, most popularly defined by a format known as GIF (graphical interchange format) introduced around 1987. Chemical diagrams encoded in GIF (and a few other common formats) were easy to create, and from 1993 to the present, their use proliferated. I argue here that their use as carriers of inter-operable chemical information has been an unmitigated disaster, and we will suffer the problems caused by their continued use for the forseeable future. Why? Well, put simply there is no standard way of identifying that their content is chemical. Only now are computer scientists starting to come up with efficient schemes for scanning the patterns in an image, creating databases of this content, and allowing search and retrieval schemes based on these patterns. The application of image recognition in chemistry remains non-trivial.4 Perhaps a better approach has been to insert chemical information into "hidden fields", which formats such as GIF and PNG (Portable Network Graphics) allow,5 and which would permit a reconstitution of the chemical information by suitable programs. Nevertheless, this approach is rarely adopted, and hence attempts to index the chemical content of the Internet have accordingly encountered enormous difficulties3. An opportunity to insert chemical information in a more simple "human readable" form was missed however. The standard way of invoking an image from a document written in HTML 3.2 or earlier is as follows;
<IMG SRC="sildenafil.gif" alt="C22H30N6O4S" WIDTH="500" HEIGHT="490">
Firstly, note that the actual name of the GIF image suggests chemical content (to a chemist!), but such names are rarely chosen to be unambiguous or complete. The dimensions do not help at all. The so-called ALT field however provides an alternative text-based descriptor of the possible content of the image. In a chemical sense, no guidelines exist for how this field might be used. If present at all, it is normally entered as free-form text; formulae, or even better some form of atom connectivity such as a characteristic SMILES atom-connection string (e.g. O=S(C1=C([H])C([H])=C(OCC)C(C(N2[H])=NC(C(CCC)=NN3C)=C3C2=O)=C1[H])(N4CCN(C)CC4)=O) have rarely been used. The image ALT field however does introduce one important new concept known as meta-data. This is simply a description of an information resource, with the term "meta" deriving from the Greek word for change. Its purpose is to document the origins of, and/or track the change or use of, data.
The HTML 4.0 specification6 replaces the image invocation with the syntax;
<OBJECT title="Sildenafil (Viagra), Molecule-of-the Month at Imperial College" data="sildenafil.gif" type="image/gif" width="500" height="490"> 5-[2-ethoxy-5-(4-methylpiperazin-1-ylsulfonyl)phenyl]-1-methyl-3-n-propyl-1,6-dihydro-7H-pyrazolo[4,3-d]pyrimidin-7-one </OBJECT>
Here, the title is a possible carrier of meta-information, whilst type refers to the media (MIME) type, in some ways another form of meta-data. The remaining text (in this case the chemical name) would only be displayed if for some reason the GIF image could not be shown. This field could serve a dual purpose in providing valuable text-based information as an alternative to attempting to recognise the image content.
Recognising that if chemical content on the Internet was to be rendered indexable, and hence retrievable, we proposed2 in early 1994 that where possible, generic images with chemical content should be replaced by a more explicit declaration. The so-called chemical MIME type was introduced, along with around 20 sub-types that represented a spectrum of chemical content carried by standard or de facto file types that the community had adopted over the preceeding 25 years, and for which tools for their generation and viewing were available to a greater or lesser extent. The molfile format documented comprehensively by MDL is a good example of this MIME type. It can contain either 2D or 3D molecule coordinates, and has an explicit declaration of the atom connectivity and bond types. Put simply, a reference to such a file from within an HTML document carries a strong assumption that an unambiguous declaration of a molecule might be expected. Other chemical MIME types defined other aspects of molecular connectivity, or specified analytical data carried in more or less standard formats which could be generated directly from analytical instruments. The support of this and about 10 other chemical MIME type via browser plug-in software such as Chime from MDL7 or ChemDraw/Chem3D Net plug-in from CambridgeSoft Corporation,8 and via Java-based applets such as ChemSymphony9 and analytical data interpreters10 has ensured that the adoption of chemical MIME has gradually increased from 1994 onwards. A typical example of how this infra-structure could be used to deliver accurate and context-rich molecular information across a range of molecular sciences is shown in Model/Figure 1. This designation is used deliberately; if you are viewing this article in print, then you will inspect a figure (or illustration) of this concept. If you are viewing the article using the Internet, then you can inspect a model.
The model, as opposed to the figure, is invoked within HTML 3.2 as follows;
<embed src=viagra.mol width=300 height=200 name=viagra BGCOLOR="white"
spinx=10 spinz=10 spiny=10 startspin=true options3d=specular display3D=ball&stick
alt="O=S(C1=C([H])C([H])=C(OCC)C(C(N2[H])=NC(C(CCC)=NN3C)=C3C2=O)=C1[H])(N4CCN(C)CC4)=O">
The first line of attributes are generic, whilst the additional forms are specific to the chemical model, and to a large extent, to the software used to resolve this model on the computer screen. The ALT field in this instance is entirely non-standard, but benign in the sense that it is ignored by the modelling software, and would serve only to provide meta-information.
The recommended future method of invoking a model within HTML 4.0 is:
<object data="viagra.mol" width="300" height="200" id="viagra" title="Viagra"
style="spinx: 10; display3D: ball&stick" type="chemical/x-mdl-molfile">
<OBJECT title="Sildenafil (Viagra), C22H30N6O4S" data="sildenafil.gif" type="image/gif" width="500" height="490">
</OBJECT>
5-[2-ethoxy-5-(4-methylpiperazin-1-ylsulfonyl)phenyl]-1-methyl-3-n-propyl-1,6-dihydro-7H-pyrazolo[4,3-d]pyrimidin-7-one
</object>
This defines a cascading order in which attempts will be made by the browser to display either the chemical/x-mdl-molfile, the image/gif or the text field objects. Within the model shown above are also objects which serve to identify small molecular components of the molecule object, such as the region of tautomerism, or a key hydrogen bond, i.e. the objects themselves can have relationships to each other.
Even by 1998, only perhaps a few percent of all molecular content on chemical web pages was identified using MIME types. These chemical MIME types are also largely seen now as a legacy from the days of proprietary formats and programs. Often, these formats lack modern mechanisms for defining internal structures and have to be considered as a single component (a "blob"). One would not expect to easily identify smaller well defined components within the document such as molecular sub-components. Such legacy files also did not provide any definition of a standard mechanism for specifying meta-data. Frequently, it might even prove impossible to identify which version or flavour of file one was dealing with (i.e. tracking the change in use of data with time). Before turning to molecular components and meta-data, one further important MIME type should be discussed.
Alongside chemical models, the MIME type model/vrml allows a more generic modelling functionality, known as Virtual Reality Modelling Language or VRML15. This type of model serves to integrate molecular models with complex 3D data representations, molecular surfaces, animations, and most interestingly, processing functionality within the model. Such models can have so-called script nodes associated with model components based on defined algorithms. For example, two molecule components of a VRML model could be associated with a defined force field that could dynamically compute the energy of interaction of the two components during an attempt to dock one model with another. Such composite scenes also allow a much richer integration of chemical with non-chemical models, and these work particularly well at the boundaries between chemistry and other disciplines. Virtual Reality models have been recently reviewed from both a general15 and a chemical perspective.11
In one sense, the problems alluded to above in identifying the chemical content of two-dimensional images are also inherent in three dimensional models. Whilst chemical models generated from formats such as the MDL molfile can be readily indexed and search for, much still needs to be done in chemically indexing the more generic VRML models. The equivalent of the image ALT field or HTML document meta-data could to be found in the so-called VRML Viewpoints, but as with image ALT declarations, no standards in their use are employed, and such viewpoints are not yet indexed routinely by any index and search services. Interesting progress has been made in the re-identification of chemical content from VRML models16 and progress is expected to be rapid in this area in the future.
A metadata record consists of a small set of attributes, or elements, necessary to describe the resource in question. These include basic attributes of a document such as its title, date, description, creator (author), format (MIME type), subject (keywords) and relation (of the resource to other resources, normally via a URL declaration). The meta-data is normally contained in a particular component of the document known as the header. Its implementation within any particular type of document can vary; that for HTML documents has recently been standardised,17 and proposals for implementation within images and perhaps even VRML types might be expected in the future. One example of using such declaration to enable a resource to be evaluated for a particular need was the inclusion of the following (highly non standard!) meta-data header in each of the articles comprising the ECTOC-3 electronic conference proceedings;18
The so-called chemical prototype attempts to define a single molecule that best represents the overall molecular content of document. In principle, this would allow automated analysis of the document to provide an indication to a user of whether the chemical content is close to their interests, or conversely whether the document represents a "dissimilar" contribution in a scan of molecular diversity. Because agreement on meta-data types and syntax within HTML documents has only recently been formalised as the Dublin Core standard,17 the five year evolution of the Web has assimilated few of these guidelines. Almost no HTML documents contain any significant meta-data declarations, and of those that do, even less are chemically useful. As an example, the on-line version of this article contains some Dublin Core declarations. A project to define a discreet set of standard chemical meta-data declarations such as coordinates, substance, computation/simulation, biological activity, safety, synthesis, characterisation, instrumentation, physicochemical data, reaction data and crystallography (provisionally christened Dublin Chem) is under way.19
The chemical document illustrated in Figure/Model 1 above illustrated how discrete molecular components could be identified within a larger molecular model. This is a very specific example of a general problem in chemistry as a whole. The underlying documents used to create Model 1 utilised a combination of HTML to define the text and links between objects, MDL molfiles to carry small molecule connectivity and 3D coordinates, CSML (chemical structure markup language) to define molecular fragments and Brookhaven PDB files to carry macromolecular information. In the summer of 1995, a project was started20 to define a chemical equivalent of HTML that would serve to provide a single self-consistent syntactic framework to replace these ad hoc methods with modular components. The latest version of CML (Chemical markup language) follows a set of guidelines known as XML (extensible markup langauge)21 and is described in detail in another article in this issue.22
Another approach to creating chemical documents involve generating them dynamically from closed databases in response to a suitably constituted request or search query. For example, the ChemFinder site3 contains a large amount of chemical information, but this presented to the user only in the form of a "just-in-time" document, and hence this content is not reflected in the statistics quoted above. It may well be that the majority of useful Internet-based chemical information will become available only via such controlled molecule or journal databases, created by a large number of authors, but controlled by a small number of publishers, a situation which reflects of course the current situation in printed publishing. The Internet does offer an alternative paradigm, in promoting the use of chemical models and other high value and modular chemical data in an open manner, and one where entirely new "added value" models of chemical resource discovery and "Knowledge Capture for Compounds" could be created using information management techniques. If this second scenario is to come about, then the creators of this information will have to make it happen. If we adopt the same models that lead to the creation of an over-abundance of paper-based information graveyards, then it will probably not happen. The future is in our hands.