Email Discussion: 48 Hyperglossary and transition states molecules guillem@gauss.uib.es
Email Discussion: Re: 48 Hyperglossary and transition states molecules Rzepa,Henry

A Molecular Hyperglossary: Organic Molecular information in Hypermedia Form

Chris Leach[a], Peter Murray-Rust[b] and Henry S. Rzepa[a]

[a]Department of Chemistry, Imperial College of Science, Technology and Medicine, London, UK and [b] Glaxo Research & Development, Stevenage, Herts, UK.

Contents: Introduction, Implementation, Features, Conclusions, Acknowledgements.

Introduction.

The dissemination of molecular information through media such as scholarly journals or conference proceedings has hitherto largely relied on the technology of the printed page for its operation. In the specific area of organic chemistry, which can be dominated by the need to accurately convey two and three dimensional information about molecular structure, connectivity and stereochemistry, the printed page has provided particular challenges. A host of conventions have to be assimilated to convey in an error free way the stereochemistry of say a natural product. When it comes to indexing such content, this has largely been based on a conversion of the molecular structure to text-based nomenclature, sometimes based on formal IUPAC conventions, more often on trivial naming. Thus finding structural, or stereochemical information on the printed pages of a journal can be a very hit-and-miss procedure. In conferences, there is rarely the opportunity to index the abstracts, papers and posters in such a way that visitors to the conference can benefit from structured searches of the conference whilst they are actually attending the conference. Furthermore, during a conference presentation using say 35 mm slides, structural information is often gained subliminally or in broad concept rather than specific detail. Certainly few chemists would be entirely happy basing laboratory work on a half-remembered structure copied down from a conference slide they may have seen for only a few seconds.

The possibility of offering both conferences and journals via the mechanism of the World-Wide web allows the limitations referred to above to be solved in a radically new way. The World-Wide Web (or simply the "Web") is a mechanism that enables information from a diverse collection of sources to be viewed easily. Using a further protocol proposed by us and known as chemical-mime it is possible to transfer organic chemical information from an information server to the reader in a manner which allows the reader to be more involved with the information content. It is also possible to reverse the direction of flow, and have the reader contributing new molecular information to create two-way communication between the reader and the information source, and hence in effect between different readers all using a common information source. This is of course particularly important when considering the "conference" as a medium, although to have this happen with a "journal" environment raises other interesting issues of peer-review, quality control etc.

To illustrate these various themes, we have created what we have termed a "molecular hyperglossary" in electronic form, and associated with the ECTOC conference that this this paper is a component of.

Another example of a working hyperglossary on the Web is the one associated with the Internet Course on The Principles of Protein Structure. This glossary is a collection of definitions of terms and molecules that people, whether they are a tutor or a student, have added to the course. These definitions can be revised by anyone, hopefully to create a wealth of relevant auxillary information. There is a facility that allows links to be added between those definitions where people believe there exists a common theme.

Implementation

This hyperglossary allows the reader to contribute 2D or 3D structural information about organic molecules and comments into a so-called electronic "form", and have this information accepted by the remote server after appropriate self-consistency checks have been performed. By this, we mean that information that cannot be interpreted as a molecule, or fails simple formula or valency checks, will be rejected. The sophistication of these checks is of course under the control of the designer of the hyperglossary. Once this stage has been successfully passed, the information can then made instantly available to the rest of the world. Some of the technical features of this hyperglossary are illustrated below.

To contribute a molecular structure to the hyperglossary, the author will need to obtain a text-based representation of this structure using a suitable program. For example, 3D coordinates can be obtained from molecular modelling or crystallographic sources. 2D coordinates can be obtained from structure drawing programs, or by defining a so-called SMILES string either manually or again using a suitable program.
The molecular information is then "pasted" into the appropriate area of a World-Wide Web FORM.
A script resident on the server indicated by the form takes this information and using a variety of programs, can convert the structure into 2D and 3D graphical representations, and various other fields, including simple ones such as molecular formulae, and more complex ones such as SMILES strings. In our implementation, we used the Tcl scripting language to define these tasks.
Clearly, many other properties could be included in the hyperglossary, including synthetic information (optionally hyperlinked to other entries in the hyperglossary), precursors, pharmacological date, toxicology information, literature references to electronic journals and so forth.
Initially, coordinate data is verified by running through an interchange program called babel [1]. The program babel also converts the molecular data from MDL MOLFILE, SHELX files and Cambridge database FDAT files to PDB files. PDB files were chosen because of the readily available viewers, such as Rasmol, that can display the molecule. Once the information is verified, the scripts process the information to create the 2D and 3D pictures using prado [2] and Raster3d respectively. Other information is automatically created such as the molecular formula, molecular weight and SMILES. The auxillary files, for example the pictures and the pdb files, are given an unique file name part generated from the time the user submitted the molecule. This prevents different submissions from overwriting each other. Daylight software is used extensively through out the hyperglossary.
The output from these programs must then be added to a set of on-line documents written in HTML format and which constitute the hyperglossary content itself. Output fields such as formula, SMILES string, comment etc, are easily marked up in the HTML language. It is also helpful to prepare small "thumbnail" 2D images of the molecule to give an indication of its nature. These are prepared as GIF images.
Where 3D molecular coordinates have either been contributed, or can be generated from the information given by the user, these can be saved in PDB format, and by the use of chemical MIME standards associated with a 2D "thumbnail" and delivered to the reader in the form of a "hyperactive molecule".
The original contribution can have a so-called Uniform Resource Locator (URL) associated with it. In effect, this means that a component of the information originally submitted can in fact reside on the originators own server, and what is submitted is simply a pointer to this information. This enables more comprehensive information associated with the molecule to be made available, and also permits updates to this information to be made at the originators site without impinging on the content of the hyperglossary proper. Any URL (Uniform Resource Locators) associated with the molecule are checked by actually probing the web server containing the supposed document, using libwww-perl.
Because requests to the database are created on the fly, up-to-date information is always available to readers of the hyperglossary.

The actual FORM used to input data is shown below;

Features of the Hyperglossary

Extensive use is made of SMILES strings to contain stereochemical information, created from 2D ISIS Draw and 3D molecular file formats. The SMILES format will also allow fuctional group searches on the database to be performed. Only limited functional group searching is possible due to the nature of the SMILES, but implementation of SMARTS, available from the Daylight suite of tools, would allow more complex substructure searches.
Stereochemistry is calculated from the position of the atoms in space. This is effective when 3D coordinates are available. If only 2D coordinates are available such as e.g. 2D MDL molfiles, an attempt is made to calculate the stereochemistry from the bond type (wedge or dash) contained in the 2D file. Project CORINA represents an alternative and reliable way of achieving this goal.
Unique SMILES can be created to uniquely identify the molecule, so that duplication of molecules can be avoided. An automatic check that the molecule being entered is unique is always made.
If references to say ECTOC conference data in the form of a paper numbers are given, then a hyperlink to the paper is automatically added, with the information contained in the title tags in the header of the paper. In this sense, the hyperglossary is more than just a simple flat database, but can contain extensive cross-referencing, thus serving to add an element of structure to the molecular entities associated with say a conference such as ECTOC.
The contents of the hyperglossary can be viewed thus;

It is important to realise that this display is generated directly from the database, and does not necessarily have to be maintained by e.g. a conference editor. Presently it is possible to search for specific text strings in the hyperglossary, including fields such as formula, name or SMILES descriptor.
If a search is successful, the results are presented in a table form;

Conclusions.

In introducing the concept of an on-line molecular hyperglossary, it was our intention to create a collaborative environment where bibliographic, and 2D/3D molecular data could be exchanged between the participants of a conference. Some degree of quality control can be automated, and such a database can serve to impart a degree of uniformity of presentation and structure to such an event. The scope of such a hyperglossary could range from local use by a research group or department, as a component of an electronic conference (as here) or as an adjunct to an electronic journal. The entire collection of information is readily indexed, and via e.g. the SMILES string is even searchable in a sub-structure context.

Future developments of the hyperglossary will include the introduction of TGF reaction files to allow organic chemists to draw a complete reaction scheme and submit it to the hyperglossary. The hyperglossary will store the reaction scheme and create a 2D picture of the scheme automatically. The structures within the reaction could be separated and used to create 3D pictures and their corresponding 3D molecular information for viewing in 3D viewers. In due course, we anticipate that commercial implementations of these concepts will become available.

Acknowledgements

We are grateful to Pat Walters, author of Babel for helpful discussions, and to Daylight Chemical Information Systems for their generosity in allowing us access to their program libraries.

Revised 4/08/95