A New Publishing Paradigm: STM Articles as part of the
Semantic Web
Henry S. Rzepaa and Peter
Murray-Rustb
aDepartment of Chemistry, Imperial College, London,
SW7 2AY. bSchool of Pharmacy, University of
Nottingham.
Abstract
An argument is presented for replacing the traditional
published scientific techmical or medical article, where heavy
reliance is placed on human perception of a printed or a
printable medium, by a more data-centric model with well
defined structures expressed using XML languages. Such articles
can be regarded as self-defining or "intelligent", they can be
scaled down to finely level details such as e.g. atoms in
molecules, up into journals or higher order collections and can
be used by software agents as well as humans. Our vision is
that this higher order concept, often referred to as the
Semantic Web, would lay the foundation for the creation of an
open and global knowledge base.
Introduction
Both scientists and publishers would agree that the
processes involved in publishing and particularly of reading
scientific articles have changed considerably over the last
five years or so. We will argue however that these changes
relate predominantly to the technical processes of publication
and delivery, and that fundamentally most authors' and readers'
concepts of what a learned paper is, and how it can be used,
remain rooted in the medium rather than the message. The often
ubiquitous use of the term "reading a paper" implies an
perceptive activity that only a human can easily do, especially
if complex visual and symbolic representations are included in
the paper. We suggest that the learned article should be
instead regarded as more of a functional tool, to be used with
the appropriate combination of software based processing and
transformation of its content, with the human providing
comprehension and enhancement of the knowledge represented by
the article.
The Current Publishing Processes
Much of the current debate about learned STM (Scientific,
Technical and Medical) publishing centres around how the
dynamics of authoring an article might involve review, comment
and annotation by peer groups; i.e the pre-print/self-print
experiments and discussion forums.1 This leads on to
the very role of publishers themselves, and the "added-value"
that they can bring to the publication process, involving
concepts such as the aggregation of articles and journals, ever
more rich contextual full-text searching of articles, and added
hyper linking both between articles and between the article and
databases of subject content. These are all layers added by the
publishers, and inevitably, since they often involve human
perception and some or all of these stages, they remain
expensive additions to the publishing process. There is also
the implicit assumption that the concept of what represents
added value is largely defined by the publishers rather than by
the authors and readers.
These debates largely assume that the intrinsic structure of
the "article" remains very much what an author or reader from
the 19th century might have been familiar with. These
structures are mostly associated with what can be described as
the "look and feel" of the journal and its articles, namely the
manner in which the logical content created by the author is
presented on the printed page (or the electronic equivalent of
the printed page, the Acrobat file). In our own area of
molecular sciences, the content is serialised onto the printed
page or Acrobat equivalent into sequential sections such as
abstract, introduction, numerical and descriptive results, a
section showing schematic representations of any new molecules
reported, a discussion relating perhaps to components (atoms,
bonds etc) of the molecules and a bibliography. A human being
can scan this serialised content and rapidly perceive its
structure and more or less accurately infer the meaning of e.g.
the schematic drawing of a molecule (although perceiving the
three dimensional structure of such a molecule from a paper
rendition is much more of a challenge!). A human is less well
suited to scan in an error free manner thousands if not
millions of such articles, and is subject to the error-prone
process of transcribing numerical data from paper. Changing the
medium of the article from paper to an Acrobat file does little
to change this process. Most people probably end up printing
the Acrobat file; few would confess to liking to read it on the
computer screen. Yet, this remains the process that virtually
everyone "using" modern electronic journals would go
through.
We argue here that data must be regarded as a critically
important part of the publication process, with documents and
data being part of a seamless spectrum. In many disciplines the
data are critical for the full use of the "article". To achieve
such seamless integration, the data content of an article must
be expressed in a far more precise way than is currently
achieved, precise enough to be not merely human perceivable,
but if necessary to be machine processable. The concept is
summarised by the term "Semantic Web", used by
Berners-Lee2 to express his vision of how the
World-Wide Web will evolve to supporting the exchange of
knowledge on the Web. The semantic web by its nature includes
the entire publishing process, and we feel that everyone
involved in the publishing process will come to recognise that
this concept really does represent a paradigm shift in the
communication and in particular the use of information and
data. A central concept to the semantic web is that data must
be self-defining, such that decisions about what it represents
and the context of how it can be acted upon or transformed are
possible not merely by humans but by software agents created by
humans for the purpose. The concepts also include some measure
of error checking if the structure and associated meaning
(ontology) of the data is available, and of mechanisms to avoid
loss of data if the meaning is not suffciently well known at
any stage.
The stages in the evolution of data and knowledge are part
of the well known scientific cycle. An example in molecular and
medicinal sciences might serve to illustrate the current
process;
- A human decides a particular molecular sub-structure is
of interest, on the basis of reading a journal article
reporting one or more whole molecular structures and their
biological properties relating to e.g. inhibition of
cancerous growth. This process is currently almost entirely
dependent on human perception.
- A search of various custom molecular databases is
conducted, using a manual transcription of the relevant
molecular structure. This implies a fair degree of knowledge
by the human about the representational meaning of the
structure they have perceived in the journal article.
Chemists tend to use highly symbolic representations of
molecules, ranging from text-based complex nomenclature to
even more abstract 2D line diagrams where many of the
components present are implied rather than declared. Licenses
to access the databases must be available, since most
molecular databases are proprietary and closed. It is quite
probable that a degree of training of the human to use each
proprietary interface to these databases will be
required.
- It is becoming more common for both primary and secondary
publishers to integrate steps 1 and 2 into a single "added
value" environment. This environment is inevitably expensive,
because it was created largely by human perception of the
original published journal articles. In effect, although the
added service is indeed valuable, the processes involved in
creating it merely represent an aggregation of what the human
starting the process would have done anyway.
- The result of the search may be a methodology for
creating new variants of the original molecule (referred to
by chemists as the "synthesis" of the molecule). The starting
materials for conducting the synthesis have to be sourced
from a supplier, and ordered by raising purchase orders from
an accounts officer.
- Nowadays, it is perfectly conceivable that a
"combinatorial" instrument or machine will need to be
programmed by the human to conduct the synthesis.
- The products of the synthesis are then analysed using
other instruments, and the results interpreted in terms of
both purity and the molecular structure. This can often
nowadays be done automatically by software agents. A
comparison of the results with previously published and
related data is often desirable.
- Biological properties of the new species can be screened,
again often automatically using instrumentation and software
agents.
- The data from all these process is then gathered, edited
by the human, and (nowadays at least) transcribed into a word
processing program in which the document structures imposed
are those of the journal "guidelines for authors" rather than
those implied by the molecular data itself. We emphasize that
this step in particular is a very lossy process, i.e. lack of
appropriate data structures will mean loss of data!
- More often than not, the document is then printed and
sent to referees. The data from components 1-7 above are only
accessible to them if they invoke their own human perception,
since the process involved in step 8 may adhere (and then
often only loosely) merely to the journal publishing and
presentational guidelines rather than to those associated
with the data harvested from steps 1-7.
- The article is finally published, the full text indexed,
and the bibliography possibly hyper linked to the other
articles cited (in a mono directional sense). The important
term here is of course "full text". In a scientific context
at least, and certainly in molecular sciences, the
prose-based textual description of the meaning inevitably
carries only part of the knowledge and information
accumulated during the steps 1-10. Full-text prose is
inevitably a lossy carrier of data and information. Even
contextual operators invoked during a search (is A adjacent
to B? Does A come before B?) recover only a proportion of the
original data and meaning. The rest must be accomplished by
humans as part of the secondary publishing process, and of
course the cycle now completes with a return to step 1.
The cycle described above is clearly lossy. Much of the
error correction, contextualisation and perception must be done
by humans. We argue, too much (we certainly do not argue for
eliminating the human entirely from the cycle!).
Learned Articles as part of a Semantic Web
It is remarkable how many of the 10 steps described above
have the potential for the symbiotic involvement of software
agents and humans. If the structures of the data passed between
any two stages in the above process and the actions resulting
could be mutually agreed, then significant automation becomes
possible, and more importantly, data or its context need not be
lost or marooned during the process. This very philosophy is at
the heart of the development and adoption of XML (extensible
markup language)3 as one mechanism for implementing
the Semantic Web, together with the other vital concept of meta
data, which serves to describe the context and meaning of data.
XML is a precise set of guidelines for writing any extended
markup language, together with a set of generic tools for
manipulating and transforming the content expressed using such
languages. Many MLs already exist and are being used; examples
include XHTML (for carrying prose descriptions in a precise and
formal manner), MathML (for describing mathematical
symbolisms),4 SVG and PlotML (for expressing
numerical data as two dimensional diagrams and
charts)5 and CML (Chemical markup
language)6for expressing the properties and
structures of collections of molecules).
We have described in technical detail elsewhere7
how we have authored, published and subsequently re-used an
article written entirely in XML languages, and so confine
ourselves here to how such an approach has the potential to
change some if not all of the processes described in steps 1-10
above. Molecular concepts such as molecule structures and
properties were captured using CML, schematic diagrams were
deployed as SVG, the prose was written in XHTML, the article
structure and bibliography was written in DocML, meta data was
captured as RDF (resource description framework),8
and the authenticity, integrity and structural validity of the
article and its various components verified by using XSIGN
digital signatures. All these various components inter-operate
with each other, and can be subject to generic tools such as
XSLT (transformations) to convert the data into the context
required or CSS (stylesheets) to present the content in e.g. a
browser window. The semantics of each XML component can be
machine-verified using documents known as DTDs (document type
descriptions) or Schemas, and where necessary components of the
article (which could be as small or finely grained as
individual atoms or bonds) can be identified using a
combination of namespaces and identifiers.
The most important new concept that emerges from the use of
XML is that the boundaries of what would conventionally be
thought of as a "paper" or "article" can be scaled both up and
down. Thus as noted above, an article could be disassembled
down to an individual marked up component such as one atom in a
molecule, or instead aggregated into a journal, collection of
journals, or ultimately into the semantic web! This need not
mean loss of identity, or provenance, since in theory at least,
each unit of information can be associated with meta data
indicating its originator, and if required a digital signature
confirming its provenance. Because the heart of XML contains
the concept that the form or style of presentation of data is
completely separated from its containment, the "look-and-feel"
of the presentation can be applied at any scale (arguably for
an individual atom, certainly for an aggregation such as a
journal, and potentially for the entire semantic web). It is
worth now reanalysing the ten steps describe above, but in the
context that everything is expressed with XML structures.
- A human or software agent acting on their behalf can
interrogate an XML-based journal, asking questions such as
"how many molecules are reported containing a particular
molecular fragment with associated biological data relating
to cancer?". This would, technically, involve software
searching for the CML or related "namespaces" to find
molecules, and checking any occurances for particular
patterns of atoms and bonds. We have indeed demonstrated a
very similar process for our own XML-based journal articles;
the issue is really only one of scale. Any citations
retrieved during this process are captured into the XML-based
project document along with relevant information such as
CML-based descriptors.
- Any retrieved molecules can now be edited or filtered by
the human (or software agent) and presented to specialised
databases for further searching (if necessary preceded by the
appropriate transformation of the molecule to accommodate any
non-standard or proprietary representations required by that
database) and any retrieved entries again formulated in
XML.
- With publishers receiving all journal articles in XML
forms, the cost of validating, aggregating, and adding value
to the content is now potentially much smaller. The publisher
can concentrate on higher forms of added value; for example
contracting to create similarity indices for various
components, or computing additional molecular
properties.
- Other XML-based sources of secondary published
information such as "Organic Syntheses" or "Science of
Synthesis" (both of which actually happen to be already
available at least partially in XML form) can be used to
locate potential synthetic methods for the required molecule.
The resulting methodology is again returned in XML form. At
this stage, purchasing decisions based on identified
commercial availability of appropriate chemicals can be made,
again with the help of software agents linking to e-commerce
systems. Many new e-commerce systems are themselves based on
XML architectures.
- The appropriate instructions, in XML-form, can be passed
to a combinatorial robot.
- Processing instructions for instruments can be derived
from the XML formulation, and the results similarly returned,
or passed to software for heuristic (rule based)
interpretation or checking. The software itself will have an
authentication and provenance that could be automatically
checked, if necessary by resolution back to a journal article
and its XML-identified authorship. We also note at this stage
that the original molecule fragment originated in step 1 is
still part of the data, but obviously subjected to very
substantial annotation with each step, the provenance of
which can be verified if necessary.
- The compound along with its accreted XML description can
now be passed to biological screening systems, which can
extract the relevant information and return the results in
the same form.
- At this stage, much human thought will be needed to make
intelligent sense of the accumulated results. To help in this
process, the XML document describing the entire project can
always be represented to the human by appropriately selective
filters and transforms, which may include statistical
analysis or computational modelling. The human can annotate
the document with appropriate prose, taking care to link
technical terms to an appropriate dictionary or glossary of
such terms so that other humans or agents can make the
ontological associations.
- Any referees of the subsequent article (whether open in a
pre-print stage, or closed in the conventional manner) will
now have access not only to the annotated prose created by
the author in the previous stage, but potentially to the more
important data accreted by the document in the previous
stages. Their ability to perform their task can only be
enhanced by having such access.
- The article is published. The publisher may choose to add
additional value to any of the components of the article,
depending on their speciality. They may also make the article
available for annotation by others.
This revised cycle is potentially at least far less lossy than
the conventional route. Of course, some loss of data is
probably desirable, since otherwise the article will become
over-burdened by superceded data. The issue of how many editing
within such a model is one the community (and commercial
reality) will decide.
Conclusions
The Semantic Web is far more than just one particular
instance of how the scientific discovery and publishing process
could be implemented. It involves a recognition by humans of
the importance of retaining the structure of data at all stages
in the discovery process. It involves them recognising the need
for inter-operability of data in the appropriate context, and
ultimately of agreeing to common ontologies for what they mean
in their own subject areas. At the heart of this model will be
the creation of an open model of publishing, which will lay the
foundation for the creation of a global knowledge base in a
particular discipline. The seamless aggregation of published
"articles" will be the foundation of such a knowledge base.
These will be grand challenges which may take a little while
to achieve. The technical problems are relatively close to
solution, although the business models may not be so!. However,
the greatest challenge will be convincing authors and readers
in the scientific communities to rethink their concepts of what
the publishing process is, and to instead think on a global
scale and of how they must change the way they work, capture
and pass on data and information into the global community.
Citations and References
- Harnad, S, Nature, 1999, 401 (6752), 423;
The topic is currently being debated on forums such as the
Nature debates; http://www.nature.com/nature/debates/e-access/index.html
or the American Scientist Forum; http://amsci-forum.amsci.org/archives/september98-forum.html
and at Chemiwtry pre-print sites such as http://preprint.chemweb.com/.
Other interesting points of view are represented by Bachrach,
S. M. "The 21st century chemistry journal", Quim. Nova
1999, 22, 273-276; Kircz, J. "New practices for
electronic publishing: quality and integrity in a multimedia
environment", UNESCO-ICSU Conference Electronic Publishing in
Science, 2001.
- Berners-Lee, T, Hendler, J, and Lassila, O, http://www.scientificamerican.com/2001/0501issue/0501berners-lee.html;
Berners-Lee, T and Fischetti, M, "Weaving the Web: The
Original Design and the Ultimate Destiny of the World-Wide
Web", Orion Business Books, London, 1999. ISBN
0752820907.
- The definitive source of information about XML projects
is available at the World-Wide Web Consortium site; http://www.w3c.org/
- See http://www.w3c.org/Math/
- SVG, see http://www.w3c.org/Graphics/SVG/; PlotML, see
http://ptolemy.eecs.berkeley.edu/ptolemyII/ptII1.0/
- Murray-Rust, P. and Rzepa, H. S, J. Chem. Inf. Comp.
Sci., 1999, 39, 928 and articles cited therein.
See http://www.xml-cml.org/
- Murray-Rust, P, Rzepa, H. S, Wright, M. and Zara, S, "A
Universal approach to Web-based Chemistry using XML and CML,
ChemComm, 2000, 1471-1472; Murray-Rust, P, Rzepa, H.
S, Wright, M, "Development of Chemical Markup Language (CML)
as a System for Handling Complex Chemical Content," New J.
Chem., 2001, 618-634. The full XML-based article can be
seen at http://www.rsc.org/suppdata/NJ/B0/B008780G/index.sht
- The RDF specifications provide a lightweight ontology
system to support the exchange of knowledge on the Web, see
http://www.w3c.org/RDF/