Use of Meta-data in Chemical Content Relevancy Ranking

Georgios V. Gkoutos and Henry S. Rzepa

Department of Chemistry, Imperial College of Science, Technology and Medicine, London, SW7 2AY.

Introduction

The increasingly widespread use from 1993 onwards of HTML (Hypertext markup language) as a structured and globally common language for constructing documents for deployment on the Internet marked the start of a revolution in indexing and retrieval of information on a truly global scale.

HTML is an open human-readable format based on 7-bit ASCII text, unlike 8-bit closed binary mode document formats readable only by using proprietary programs. Because the internal structure of a document written in HTML is defined using well-formed markup instructions, building an index from a diverse collection of such documents, which might have been created by many different authors using a variety of styles, remains a relatively simple technical operation. The first such index known as Lycos was created in 1994 and used as its source around 30 million HTML-based documents, retriving these documents from the Internet using a so-called "spider" or "robot". Much imitated subsequently, such global indices now form perhaps the most important mechanism for classifying, ordering and retrieving information from Internet-based document and data collections.

By design a very simple and specifically a general markup language, HTML allows for only very basic internal document structures to be defined. It has little syntax for defining subject specific "information components" and serves specific subject areas such as chemistry particularly poorly. The limitations of HTML were recognised early on, and a number of initiatives to increase the quality of text-based indexing of such documents were started. One of these, known as the Dublin Core (DC) project, attempted to define a small and limited set of so-called metadata elements to be defined with a document, and which an index engine could use to achieve a more valuable ranking of information and identification of data present within the document.¹

The term metadata can be a somewhat elusive concept to fully grasp; at its simplest it is often described as "data about data". An analogy serves to illustrate better its meaning. A restaurant exists to serve food to its customers, but it first has to entice these customers into the restaurant. Most restaurants do so by creating a menu of dishes available to potential diners, and placing this in a prominent position outside the restaurant. The menu serves as metadata describing the food, indicating the price of the meal, and perhaps including some information about the origins of the dish, the chef that might cook it, the "hotness" of oriental dishes, and so forth. Such information is normally sufficient to allow prospective diners to decide whether to enter the restaurant, but of course in no way can it be described as a replacement for the full gastronomic experience. DC is an attempt to create the equivalent of a restaurant menu for HTML document collections. We emphasize that a schema such as the Dublin Core should NOT be considered as a complete solution for compensating for the limitations of HTML itself, but rather a mechanism for enhancing the usefulness of such documents. DC is of course not limited just to HTML collections, but can be expressed in many other types of data and document types.

In this short note, we will describe the more useful DC metadata elements, and how they can be used to create an enhanced-value index for the ECSOC-1 and ECSOC-2 HTML document collections. The origins of DC as a bibilographic descriptor devised primarily by librarians for their own use also means that its direct usefulness for enhancing the "chemical" menu of a HTML document are quite limited. We address this by introducing a further proposed schema based on what we term chemical meta-elements, and show how on the basis of two articles published in Molecules a more accessible chemical index could be constructed. Finally, we note that to achieve a more finely grained chemical resource, the answer lies not in devising more complex metadata schemes but in employing a more finely grained and specific markup based on the new XML, or eXtensible markup language. One chemical implementation of XML known as CML (Chemical Markup language) has recently been described.²

The Dublin Core Elements

Although the full DC set includes 15 elements, we have chosen to define only 8 of these (Table 1) within the ECSOC document collection. The DC elements are formally entered into the so-called document header, a section that is followed by the document body, where the principle content is located. Although it is possible to enter the elements and their values using simple text editing tools, several Internet-based services exist which can read an HTML document, identify any existing metadata it contains, and allow the author to add missing elements or edit the content. We made use of the DC.dot generator for this.¹ This involves invoking the DC.Dot tool with the URL

http://www.ukoln.ac.uk/dcdot/

, and then entering into the form presented the URL of the document being edited. Most of the values for the fields were already present in the majority of articles presented at the ECSOC conferences, although clearly some decisions on behalf of authors did have to be made. It would be much better of course if these values were routinely supplied by authors themselves when preparing the articles. Such procedures were in fact followed in the submission procedures for the ECHET98 electronic conference held in 1998.³

Table 1. Selected Elements of the Dublin Core Schema
Element Name	Description of the element	Deployment in HTML 4.0
HEAD	Specifies the location of a meta data profile.	<HEAD profile="http://purl.org/metadata/dublin_core">
DC.title	Title of Document	<META NAME="DC.title" CONTENT="">
DC.creator	Author(s) of article	<META NAME="DC.creator" CONTENT="">
DC.subject	Keywords describing article	<META NAME="DC.subject" CONTENT="">
DC.description	Abstract of article	<META NAME="DC.description" CONTENT="">
DC.date	Date of article (submisson date)	<META NAME="DC.date" CONTENT="1999-03-25">
DC.publisher	Address of authors	<META NAME="DC.publisher" CONTENT="">
DC.type	Type of Document.	<META NAME="DC.type" CONTENT="chemical">
DC.coverage	Field of coverage	<META NAME="DC.coverage" CONTENT="organic synthesis">

Chemical Extension to the Dublin Core Elements.

A possible schema for chemical extension is shown in Table 2. Note that the designation DC.chem is used to indicate a hierarchical schema following the principles of the Dublin Core. The abreviation chem is used to avoid semantic controversy over the difference between "chemistry" and "chemical" as the content descriptor. We believe in fact the chemical is a better descriptor of the content of the document; chemistry by way of distinction being the field of coverage rather than the content. To unambiguously identify the schema in use, the recommendation (in HTML 4) is that a meta data profile is specified in the document header linking to a declaration of the schema.

By including elements of this type, in effect the author of the document is declaring that (a) some form of chemical data will be associated with the document, or its children and (b) identifies the nature of that content to allow subsequent specific parsing of the data by additional software agents if desired. It is not in general intended that finely-grained chemical data be included within the content of the DC.chem declarations, although some types of chemical metadata such as the molecular formula, the molecular weight or a simple atom connection table such as the SMILES string may be valuable. The purpose of such entries is to assist in more extensive analysis of the document should this be desired, and to help in identifying the resources that will be needed to perform this analysis. This specific task can be implemented via the declaration of a SCHEME attribute. For example, the metaelement

could be used to automatically invoke an external parser from the search engine capable of analysing the meta-content content of the particular file the content of which is specified by the scheme "PDB" and the type by the MIME identifier chemical/x-pdb. The PDB format for example has meta identifiers such as TITLE, AUTHOR, KEYWORDS etc which can be usefully indexed. A full parsing of the complete content is left to specialised chemical search engines. Such an external parser has in fact been used (albeit by manual and not automatic configuration) for the document collection indexed in this example.⁴ If necessary, an associated LINK element can be included in the metadata to enable any agent, automatic or otherwise, to acquire a formal definition of the metadata type, and if necessary the appropriate software resources to analyse the content of the document (for example a remote applet to be used as an external parser).

We also note that whilst the declaration is preferable to the alternative , the latter is far more likely to be compatible with any existing "structured field" index engine (such as htdig or JObjects used in these examples. The former would require specific modifications to these tools to recognise the scheme attribute. For this reason, we have used the latter format, whilst recognising that should suitable support for scheme declarations become available, the latter can be easily superceded.

Table 2. A Chemical Metadata Schema

Element Name Description of the element Deployment in HTML 4.0
HEAD Specifies the location of a meta data profile. <HEAD profile="http://www.ch.ic.ac.uk/profiles/chemical/">
DC.chem.coordinates Molecular coordinates <META NAME="DC.chem.coordinates" SCHEME="pdb" CONTENT="chemical/x-pdb">
DC.chem.substance.formula Formula constitution <META NAME="DC.chem.substance.formula" SCHEME="formula" CONTENT="C22H30N6O4S">
DC.chem.substance.smiles Connection table for molecule <META NAME="DC.chem.substance.smiles" SCHEME="smiles" CONTENT="">
<link rel="DC.chem.substance" type="text/html" HREF="http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html">
DC.chem.computation-simulation Presence of computed or simulated property <META NAME="DC.chem.computation-simulation" SCHEME="MOPAC" CONTENT="PM3">
DC.chem.biological-activity Biological activity <META NAME="DC.chem.biological-activity" SCHEME="..." CONTENT="1-">
DC.chem.safety Type of chemical safety information <META NAME="DC.chem.safety" SCHEME="..." CONTENT="">
DC.chem.characterisation Characterisation mode of molecule <META NAME="DC.chem.characterisation" CONTENT="MP, HPLC, IR, 1H NMR">
DC.chem.instrumentation Associated instrumentation <META NAME="DC.chem.instrumentation" CONTENT="">
DC.chem.physicochemical-data Molecular properties <META NAME="DC.chem.physicochemical-data" CONTENT="">
DC.chem.reaction-data Reaction classification <META NAME="DC.chem.reaction-data" SCHEME="GRINS" CONTENT="">
<link rel="DC.chem.reaction-data" type="text/html" HREF="http://www.daylight.com/dayhtml/doc/theory/theory.grins.html">
DC.chem.crystallography Crystallographic information <META NAME="DC.chem.crystallography" "SCHEME=BCA" CONTENT="">

Indexing and Search the Document Collection

Once the metadata elements have been defined within the HTML documents, the task of created a structured field index of the collection can be easily automated and scaled up to arbitrarily large document collections.

Indexing Using htdig

To achieve this, we used two index tools which recognise defined metadata schemas. The first, termed htdig⁴ allows for indexing of a remote document server, and is particularly useful for creating a single index from a document collection residing on multiple servers. Version 3.1.2 of htdig supports the definition of a metadata schema, and assignment of weighting factors to individual elements in that schema, and also includes the ability to assign external parsers capable of indexing e.g. files with explicit chemical content such as coordinates or spectra. It does not however support searches based on specific elements of that schema (for example, a search for only authors), nor does it support the creation of specific templates for the display of results which make use of metadata elements (i.e. the automatic display of any address associated with an author). Another feature of htdig is that is is a unix based system which is only supported in a server-client configuration, i.e. to perform a search one must connect to the remote server where the index database resides. Such a system is not really suitable for deployment on a CD-ROM.

Indexing Using Quest Agent

To overcome these difficulties, we also implemented a Java-based tool known as Quest.⁵ The Quest agent component of this tool performs the indexing of the document collection, and this is easily configured to accept any pre-defined schema such as DC, together with chemical extensions to the DC schema. As with htdig, each DC element of the schema can be assigned a weighting factor (normally selected as 10) normalised to a value of 1 for any term located only in the body of the document. Figure 1 shows the Quest configuration entries for the standard DC schema and the Chemical extensions specifically implemented for two articles^8,9 to illustrate the concept.

Figure 1. Chemical Extensions and the Dublin Core Configuration Settings
Dublin Core Chemical Extensions

Searching Using the Quest Client

Once the indexing operation is complete, searches can be conducted by deploying specific Quest Java applets within the appropriate section of the HTML document collection. The Quest client allows a variety of user-interfaces to be used.

The so-called fielded search applet allows each DC field to be searched for individually, or in combination with other fields using boolean logic. This allows for example articles by individual authors to be located, articles coming from specified institutions, or published on a specified date.
Figure 2. Deploy a Field Search Applet for Dublin Core fields.
For chemical metadata declarations, this format allows searches of specific areas of chemistry such as molecular coordinates, reaction-data etc. etc.
Figure 3. Deploy a Field Search Applet for Chemical fields.
Another valuable search type is the search-term highlighting mode illustrated in the search page for this CD-ROM. This form of search is particularly useful when accessing long documents where multiple occurances of the search term might occur. This operation involves specifically editing the original HTML document dynamically to insert a highlighting tag such as <font color=red>search term</font> and optionally an additional navigation aid to jump to either the next or previous occurance of the search term. This specific dynamic editing operation is achieved by using parameters passed from the Quest Java applet to Javascript controls embedded within the HTML search document.

Applications of Meta-Data to Concept Maps.

The traditional way of displaying the results of an index search is a linear presentation, sorted either alphabetically, or more usefully by a rank determined by word frequency, by weighting factor (e.g. terms found in metadata are given a higher weighting) or word proximity. Recently, some interesting new methods have been developed for displaying search results. These include the Capuccino⁶ from the IBM Research laboratories or so-called Hyperbolic tree representations, developed by InXight⁷. These take the form of concept maps of the document collection, centered around e.g. the most highly document ranked, and with links to associated documents. The association of metadata with such dynamically generated diagrams represents an interesting method for information retrieval. An example of such a concept map generated from the small collection of documents associated with this article, and two others published^8,9 on the CD-ROM is shown in Figure 4.

Figure 4. A Concept Map generated using the Mapucinno System⁶

Conclusions

Whilst of course such search and retrieval can also be achieved by conventional full-text indexing of a document, we emphasize the advantages of declaring metadata elements within the HTML documents is to identify specific data components contained within the documents and to allocate such components a greater weight than the body of the full-text. The value of declaring such metadata in the document header is that it permits a separation of data components identified or entered by individual authors (i.e. the metadata elements) from the separate task of creating a global index of the entire document collection across multiply-authored documents. Such a separation allows automated indexing of such document collections using the most appropriate tools. Because no assumptions are made about what type of indexing tools will be used, the resultant document collection can be indexed and searched using different tools as appropriate. Thus for an Intranet based model, a search harvester such as htdig might be appropriate, whereas for CDROM deployment, a local Java-based client is more suitable. In a subsequent article, we will discuss the development of additional tools which make use of document metadata to (a) create a metadata editor based on a defined schema and (b) to harvest specifically the metadata components of a remote site and to classify these according to the elements located.¹⁰

Acknowledgements.

One of us (GVG) thanks Merck, Sharp and Dohme for the award of a studentship. We are also grateful to JObjects Inc for preview releases of the Quest software.

References

S. Weibel, Bull. Am. Soc. Inf. Sci., 1997, 24, 9-11. A similar model for medical metadata has been proposed; G.Malet, F. Munoz, R. Appleyard, W. Hersh J. Am. Med. Informatics Assoc., 1999, 6, 163-172. See http://www.ukoln.ac.uk/dcdot/ for details of the Dublin Core Project and other useful tools for analysising and creating DC content.
P. Murray-Rust and H. S. Rzepa, J. Chem. Inf. Comp. Sci, 1999, submitted for publication.
"Electronic Conference on Heterocyclic Chemistry '98", H. S. Rzepa and O. Kappe, (Eds), Imperial College Press, 1998, ISBN 981-02-3594-1 See also http://www.ch.ic.ac.uk/ectoc/echet98/.
See http://www.htdig.org/ for further details. An article describing a "dig" of select sites using external parsers for identifying meta content in chemical files will be published elsewhere; G. V. Gkoutos, H. S. Rzepa and A. N. Turner, to be submitted.
Available from http://www.jobjects.com/. External chemical parsers for this utility are available upon request from the authors (G. V. Gkoutos, H. S. Rzepa and A. N. Turner, to be submitted).
Mapuccino: http://www.ibm.com/java/mapuccino/. For license agreement see here.
See http://www.inxight.com/
H. S. Rzepa, Molecules, 1998, 3, 94-99.
P. May, Molecules 1998, 3, 16-19.
G. V. Gkoutos, H. S. Rzepa and A. N. Turner, to be submitted.