[a]Department of Chemistry, Imperial College of Science, Technology and
Medicine, London, UK and [b] Glaxo Research & Development, Stevenage, Herts, UK.
The dissemination of molecular information through media such as scholarly
journals or conference proceedings has hitherto largely relied on the
technology of the printed page for its operation. In the specific area of
organic chemistry, which can be dominated by the need to accurately convey two
and three dimensional information about molecular structure, connectivity and
stereochemistry, the printed page has provided particular challenges. A host of
conventions have to be assimilated to convey in an error free way the
stereochemistry of say a natural product. When it comes to indexing such
content, this has largely been based on a conversion of the molecular structure
to text-based nomenclature, sometimes based on formal IUPAC conventions, more
often on trivial naming. Thus finding structural, or stereochemical
information on the printed pages of a journal can be a very hit-and-miss
procedure. In conferences, there is rarely the opportunity to index the
abstracts, papers and posters in such a way that visitors to the conference can
benefit from structured searches of the conference whilst they are actually
attending the conference. Furthermore, during a conference presentation using
say 35 mm slides, structural information is often gained subliminally or in
broad concept rather than specific detail. Certainly few chemists would be
entirely happy basing laboratory work on a half-remembered structure copied
down from a conference slide they may have seen for only a few seconds.
The possibility of offering both conferences and journals via
the mechanism of the World-Wide web allows the limitations referred to above to
be solved in a radically new way. The World-Wide Web (or simply the "Web") is
a mechanism that enables information from a diverse collection of sources
to be viewed easily. Using a further protocol proposed by us and known as chemical-mime it is possible to
transfer organic chemical information from an information server to the reader
in a manner which allows the reader to be more involved with the information
content. It is also possible to reverse the direction of flow, and have the
reader contributing new molecular information to create two-way communication
between the reader and the information source, and hence in effect between
different readers all using a common information source. This is of course
particularly important when considering the "conference" as a medium, although
to have this happen with a "journal" environment raises other interesting
issues of peer-review, quality control etc.
To illustrate these various themes, we have created what we have termed a
"molecular hyperglossary" in electronic form, and associated with the ECTOC
conference that this this paper is a component of.
Another example of a working hyperglossary on the Web is the one associated
with the Internet Course on The Principles of Protein
Structure. This glossary is a collection of definitions of terms and
molecules that people, whether they are a tutor or a student, have added to the
course. These definitions can be revised by anyone, hopefully to create a
wealth of relevant auxillary information. There is a facility that allows
links to be added between those definitions where people believe there exists a
common theme.
This hyperglossary allows the reader to contribute 2D or 3D structural
information about organic molecules and comments into a so-called electronic
"form", and have this information accepted by the remote server after
appropriate self-consistency checks have been performed. By this, we mean that
information that cannot be interpreted as a molecule, or fails simple formula
or valency checks, will be rejected. The sophistication of these checks is of
course under the control of the designer of the hyperglossary. Once this stage
has been successfully passed, the information can then made instantly available
to the rest of the world. Some of the technical features of this hyperglossary
are illustrated below.
To contribute a molecular structure to the hyperglossary, the author will
need to obtain a text-based representation of this structure using a suitable
program. For example, 3D coordinates can be obtained from molecular modelling
or crystallographic sources. 2D coordinates can be obtained from structure
drawing programs, or by defining a so-called SMILES string either manually or again
using a suitable program.
The molecular information is then "pasted" into the appropriate area of a
World-Wide Web FORM.
A script resident on the server indicated by the form takes this
information and using a variety of programs, can convert the structure into 2D
and 3D graphical representations, and various other fields, including simple
ones such as molecular formulae, and more complex ones such as SMILES strings.
In our implementation, we used the Tcl scripting language to define these
tasks.
Clearly, many other properties could be included in the hyperglossary,
including synthetic information (optionally hyperlinked to other entries in the
hyperglossary), precursors, pharmacological date, toxicology information,
literature references to electronic journals and so forth.
Initially, coordinate data is verified by running through an interchange
program called babel
[1]. The program babel also converts the molecular
data from MDL MOLFILE, SHELX files and
Cambridge database FDAT files to PDB files. PDB files were chosen because of
the readily available viewers, such as Rasmol, that can display the
molecule. Once the information is verified, the scripts process the
information to create the 2D and 3D pictures using prado [2] and Raster3d
respectively. Other information is automatically created such as the molecular
formula, molecular weight and SMILES. The auxillary files,
for example the pictures and the pdb files, are given an unique file name part
generated from the time the user submitted the molecule. This prevents
different submissions from overwriting each other. Daylight software is used
extensively through out the hyperglossary.
The output from these programs must then be added to a set of on-line
documents written in HTML format and which constitute the hyperglossary content
itself. Output fields such as formula, SMILES string, comment etc, are easily
marked up in the HTML language. It is also helpful to prepare small "thumbnail"
2D images of the molecule to give an indication of its nature. These are
prepared as GIF images.
Where 3D molecular coordinates have either been contributed, or can be
generated from the information given by the user, these can be saved in PDB
format, and by the use of chemical MIME standards
associated with a 2D "thumbnail" and delivered to the reader in the form of a
"hyperactive molecule".
The original contribution can have a so-called Uniform Resource Locator
(URL) associated with it. In effect, this means that a component of the
information originally submitted can in fact reside on the originators own
server, and what is submitted is simply a pointer to this information. This
enables more comprehensive information associated with the molecule to be made
available, and also permits updates to this information to be made at the
originators site without impinging on the content of the hyperglossary proper.
Any URL (Uniform Resource Locators) associated with the molecule are checked by
actually probing the web server containing the supposed document, using libwww-perl.
Because requests to the database are created on the fly, up-to-date
information is always available to readers of the hyperglossary.
The actual FORM used to input data is shown below;
Extensive use is made of SMILES strings to contain stereochemical
information, created from 2D ISIS Draw and 3D molecular file formats. The
SMILES format will also allow fuctional group searches on the database to be
performed. Only limited functional group searching is possible due to the
nature of the SMILES, but implementation of SMARTS, available from the
Daylight suite of tools, would allow more complex substructure
searches.
Stereochemistry is calculated from the position of the atoms in space.
This is effective when 3D coordinates are available. If only 2D coordinates are
available such as e.g. 2D MDL molfiles, an attempt is made to calculate the
stereochemistry from the bond type (wedge or dash) contained in the 2D file. Project
CORINA represents an alternative and reliable way of achieving this goal.
Unique SMILES can be created to uniquely identify the molecule, so that
duplication of molecules can be avoided. An automatic check that the molecule
being entered is unique is always made.
If references to say ECTOC conference data in the form of a paper numbers
are given, then a hyperlink to the paper is automatically added, with the
information contained in the title tags in the header of the paper. In this
sense, the hyperglossary is more than just a simple flat database, but can
contain extensive cross-referencing, thus serving to add an element of
structure to the molecular entities associated with say a conference such as
ECTOC.
The contents of the hyperglossary can be viewed thus;
It is important to realise that this display is generated directly from the
database, and does not necessarily have to be maintained by e.g. a conference
editor. Presently it is possible to search for specific text strings in the
hyperglossary, including fields such as formula, name or SMILES descriptor.
If a search is successful, the results are presented in a table form;
In introducing the concept of an on-line molecular hyperglossary, it was our
intention to create a collaborative environment where bibliographic, and 2D/3D
molecular data could be exchanged between the participants of a conference.
Some degree of quality control can be automated, and such a database can serve
to impart a degree of uniformity of presentation and structure to such an
event. The scope of such a hyperglossary could range from local use by a
research group or department, as a component of an electronic conference (as
here) or as an adjunct to an electronic journal. The entire collection of
information is readily indexed, and via e.g. the SMILES string is even
searchable in a sub-structure context.
Future developments of the hyperglossary will include the introduction of TGF reaction files to allow organic
chemists to draw a complete reaction scheme and submit it to the hyperglossary.
The hyperglossary will store the reaction scheme and create a 2D picture of the
scheme automatically. The structures within the reaction could be separated and
used to create 3D pictures and their corresponding 3D molecular information for
viewing in 3D viewers. In due course, we anticipate that commercial
implementations of these concepts will become available.
We are grateful to Pat Walters, author of Babel for helpful
discussions, and to Daylight Chemical
Information Systems for their generosity in allowing us access to their program
libraries.
Revised 4/08/95